--- name: torchdrug description: "Graph-based drug discovery toolkit. Molecular property prediction (ADMET), protein modeling, knowledge graph reasoning, molecular generation, retrosynthesis, GNNs (GIN, GAT, SchNet), 40+ datasets, for PyTorch-based ML on molecules, proteins, and biomedical graphs." --- # TorchDrug ## Overview TorchDrug is a comprehensive PyTorch-based machine learning toolbox for drug discovery and molecular science. Apply graph neural networks, pre-trained models, and task definitions to molecules, proteins, and biological knowledge graphs, including molecular property prediction, protein modeling, knowledge graph reasoning, molecular generation, retrosynthesis planning, with 40+ curated datasets and 20+ model architectures. ## When to Use This Skill This skill should be used when working with: **Data Types:** - SMILES strings or molecular structures - Protein sequences or 3D structures (PDB files) - Chemical reactions and retrosynthesis - Biomedical knowledge graphs - Drug discovery datasets **Tasks:** - Predicting molecular properties (solubility, toxicity, activity) - Protein function or structure prediction - Drug-target binding prediction - Generating new molecular structures - Planning chemical synthesis routes - Link prediction in biomedical knowledge bases - Training graph neural networks on scientific data **Libraries and Integration:** - TorchDrug is the primary library - Often used with RDKit for cheminformatics - Compatible with PyTorch and PyTorch Lightning - Integrates with AlphaFold and ESM for proteins ## Getting Started ### Installation ```bash pip install torchdrug # Or with optional dependencies pip install torchdrug[full] ``` ### Quick Example ```python from torchdrug import datasets, models, tasks from torch.utils.data import DataLoader # Load molecular dataset dataset = datasets.BBBP("~/molecule-datasets/") train_set, valid_set, test_set = dataset.split() # Define GNN model model = models.GIN( input_dim=dataset.node_feature_dim, hidden_dims=[256, 256, 256], edge_input_dim=dataset.edge_feature_dim, batch_norm=True, readout="mean" ) # Create property prediction task task = tasks.PropertyPrediction( model, task=dataset.tasks, criterion="bce", metric=["auroc", "auprc"] ) # Train with PyTorch optimizer = torch.optim.Adam(task.parameters(), lr=1e-3) train_loader = DataLoader(train_set, batch_size=32, shuffle=True) for epoch in range(100): for batch in train_loader: loss = task(batch) optimizer.zero_grad() loss.backward() optimizer.step() ``` ## Core Capabilities ### 1. Molecular Property Prediction Predict chemical, physical, and biological properties of molecules from structure. **Use Cases:** - Drug-likeness and ADMET properties - Toxicity screening - Quantum chemistry properties - Binding affinity prediction **Key Components:** - 20+ molecular datasets (BBBP, HIV, Tox21, QM9, etc.) - GNN models (GIN, GAT, SchNet) - PropertyPrediction and MultipleBinaryClassification tasks **Reference:** See `references/molecular_property_prediction.md` for: - Complete dataset catalog - Model selection guide - Training workflows and best practices - Feature engineering details ### 2. Protein Modeling Work with protein sequences, structures, and properties. **Use Cases:** - Enzyme function prediction - Protein stability and solubility - Subcellular localization - Protein-protein interactions - Structure prediction **Key Components:** - 15+ protein datasets (EnzymeCommission, GeneOntology, PDBBind, etc.) - Sequence models (ESM, ProteinBERT, ProteinLSTM) - Structure models (GearNet, SchNet) - Multiple task types for different prediction levels **Reference:** See `references/protein_modeling.md` for: - Protein-specific datasets - Sequence vs structure models - Pre-training strategies - Integration with AlphaFold and ESM ### 3. Knowledge Graph Reasoning Predict missing links and relationships in biological knowledge graphs. **Use Cases:** - Drug repurposing - Disease mechanism discovery - Gene-disease associations - Multi-hop biomedical reasoning **Key Components:** - General KGs (FB15k, WN18) and biomedical (Hetionet) - Embedding models (TransE, RotatE, ComplEx) - KnowledgeGraphCompletion task **Reference:** See `references/knowledge_graphs.md` for: - Knowledge graph datasets (including Hetionet with 45k biomedical entities) - Embedding model comparison - Evaluation metrics and protocols - Biomedical applications ### 4. Molecular Generation Generate novel molecular structures with desired properties. **Use Cases:** - De novo drug design - Lead optimization - Chemical space exploration - Property-guided generation **Key Components:** - Autoregressive generation - GCPN (policy-based generation) - GraphAutoregressiveFlow - Property optimization workflows **Reference:** See `references/molecular_generation.md` for: - Generation strategies (unconditional, conditional, scaffold-based) - Multi-objective optimization - Validation and filtering - Integration with property prediction ### 5. Retrosynthesis Predict synthetic routes from target molecules to starting materials. **Use Cases:** - Synthesis planning - Route optimization - Synthetic accessibility assessment - Multi-step planning **Key Components:** - USPTO-50k reaction dataset - CenterIdentification (reaction center prediction) - SynthonCompletion (reactant prediction) - End-to-end Retrosynthesis pipeline **Reference:** See `references/retrosynthesis.md` for: - Task decomposition (center ID → synthon completion) - Multi-step synthesis planning - Commercial availability checking - Integration with other retrosynthesis tools ### 6. Graph Neural Network Models Comprehensive catalog of GNN architectures for different data types and tasks. **Available Models:** - General GNNs: GCN, GAT, GIN, RGCN, MPNN - 3D-aware: SchNet, GearNet - Protein-specific: ESM, ProteinBERT, GearNet - Knowledge graph: TransE, RotatE, ComplEx, SimplE - Generative: GraphAutoregressiveFlow **Reference:** See `references/models_architectures.md` for: - Detailed model descriptions - Model selection guide by task and dataset - Architecture comparisons - Implementation tips ### 7. Datasets 40+ curated datasets spanning chemistry, biology, and knowledge graphs. **Categories:** - Molecular properties (drug discovery, quantum chemistry) - Protein properties (function, structure, interactions) - Knowledge graphs (general and biomedical) - Retrosynthesis reactions **Reference:** See `references/datasets.md` for: - Complete dataset catalog with sizes and tasks - Dataset selection guide - Loading and preprocessing - Splitting strategies (random, scaffold) ## Common Workflows ### Workflow 1: Molecular Property Prediction **Scenario:** Predict blood-brain barrier penetration for drug candidates. **Steps:** 1. Load dataset: `datasets.BBBP()` 2. Choose model: GIN for molecular graphs 3. Define task: `PropertyPrediction` with binary classification 4. Train with scaffold split for realistic evaluation 5. Evaluate using AUROC and AUPRC **Navigation:** `references/molecular_property_prediction.md` → Dataset selection → Model selection → Training ### Workflow 2: Protein Function Prediction **Scenario:** Predict enzyme function from sequence. **Steps:** 1. Load dataset: `datasets.EnzymeCommission()` 2. Choose model: ESM (pre-trained) or GearNet (with structure) 3. Define task: `PropertyPrediction` with multi-class classification 4. Fine-tune pre-trained model or train from scratch 5. Evaluate using accuracy and per-class metrics **Navigation:** `references/protein_modeling.md` → Model selection (sequence vs structure) → Pre-training strategies ### Workflow 3: Drug Repurposing via Knowledge Graphs **Scenario:** Find new disease treatments in Hetionet. **Steps:** 1. Load dataset: `datasets.Hetionet()` 2. Choose model: RotatE or ComplEx 3. Define task: `KnowledgeGraphCompletion` 4. Train with negative sampling 5. Query for "Compound-treats-Disease" predictions 6. Filter by plausibility and mechanism **Navigation:** `references/knowledge_graphs.md` → Hetionet dataset → Model selection → Biomedical applications ### Workflow 4: De Novo Molecule Generation **Scenario:** Generate drug-like molecules optimized for target binding. **Steps:** 1. Train property predictor on activity data 2. Choose generation approach: GCPN for RL-based optimization 3. Define reward function combining affinity, drug-likeness, synthesizability 4. Generate candidates with property constraints 5. Validate chemistry and filter by drug-likeness 6. Rank by multi-objective scoring **Navigation:** `references/molecular_generation.md` → Conditional generation → Multi-objective optimization ### Workflow 5: Retrosynthesis Planning **Scenario:** Plan synthesis route for target molecule. **Steps:** 1. Load dataset: `datasets.USPTO50k()` 2. Train center identification model (RGCN) 3. Train synthon completion model (GIN) 4. Combine into end-to-end retrosynthesis pipeline 5. Apply recursively for multi-step planning 6. Check commercial availability of building blocks **Navigation:** `references/retrosynthesis.md` → Task types → Multi-step planning ## Integration Patterns ### With RDKit Convert between TorchDrug molecules and RDKit: ```python from torchdrug import data from rdkit import Chem # SMILES → TorchDrug molecule smiles = "CCO" mol = data.Molecule.from_smiles(smiles) # TorchDrug → RDKit rdkit_mol = mol.to_molecule() # RDKit → TorchDrug rdkit_mol = Chem.MolFromSmiles(smiles) mol = data.Molecule.from_molecule(rdkit_mol) ``` ### With AlphaFold/ESM Use predicted structures: ```python from torchdrug import data # Load AlphaFold predicted structure protein = data.Protein.from_pdb("AF-P12345-F1-model_v4.pdb") # Build graph with spatial edges graph = protein.residue_graph( node_position="ca", edge_types=["sequential", "radius"], radius_cutoff=10.0 ) ``` ### With PyTorch Lightning Wrap tasks for Lightning training: ```python import pytorch_lightning as pl class LightningTask(pl.LightningModule): def __init__(self, torchdrug_task): super().__init__() self.task = torchdrug_task def training_step(self, batch, batch_idx): return self.task(batch) def validation_step(self, batch, batch_idx): pred = self.task.predict(batch) target = self.task.target(batch) return {"pred": pred, "target": target} def configure_optimizers(self): return torch.optim.Adam(self.parameters(), lr=1e-3) ``` ## Technical Details For deep dives into TorchDrug's architecture: **Core Concepts:** See `references/core_concepts.md` for: - Architecture philosophy (modular, configurable) - Data structures (Graph, Molecule, Protein, PackedGraph) - Model interface and forward function signature - Task interface (predict, target, forward, evaluate) - Training workflows and best practices - Loss functions and metrics - Common pitfalls and debugging ## Quick Reference Cheat Sheet **Choose Dataset:** - Molecular property → `references/datasets.md` → Molecular section - Protein task → `references/datasets.md` → Protein section - Knowledge graph → `references/datasets.md` → Knowledge graph section **Choose Model:** - Molecules → `references/models_architectures.md` → GNN section → GIN/GAT/SchNet - Proteins (sequence) → `references/models_architectures.md` → Protein section → ESM - Proteins (structure) → `references/models_architectures.md` → Protein section → GearNet - Knowledge graph → `references/models_architectures.md` → KG section → RotatE/ComplEx **Common Tasks:** - Property prediction → `references/molecular_property_prediction.md` or `references/protein_modeling.md` - Generation → `references/molecular_generation.md` - Retrosynthesis → `references/retrosynthesis.md` - KG reasoning → `references/knowledge_graphs.md` **Understand Architecture:** - Data structures → `references/core_concepts.md` → Data Structures - Model design → `references/core_concepts.md` → Model Interface - Task design → `references/core_concepts.md` → Task Interface ## Troubleshooting Common Issues **Issue: Dimension mismatch errors** → Check `model.input_dim` matches `dataset.node_feature_dim` → See `references/core_concepts.md` → Essential Attributes **Issue: Poor performance on molecular tasks** → Use scaffold splitting, not random → Try GIN instead of GCN → See `references/molecular_property_prediction.md` → Best Practices **Issue: Protein model not learning** → Use pre-trained ESM for sequence tasks → Check edge construction for structure models → See `references/protein_modeling.md` → Training Workflows **Issue: Memory errors with large graphs** → Reduce batch size → Use gradient accumulation → See `references/core_concepts.md` → Memory Efficiency **Issue: Generated molecules are invalid** → Add validity constraints → Post-process with RDKit validation → See `references/molecular_generation.md` → Validation and Filtering ## Resources **Official Documentation:** https://torchdrug.ai/docs/ **GitHub:** https://github.com/DeepGraphLearning/torchdrug **Paper:** TorchDrug: A Powerful and Flexible Machine Learning Platform for Drug Discovery ## Summary Navigate to the appropriate reference file based on your task: 1. **Molecular property prediction** → `molecular_property_prediction.md` 2. **Protein modeling** → `protein_modeling.md` 3. **Knowledge graphs** → `knowledge_graphs.md` 4. **Molecular generation** → `molecular_generation.md` 5. **Retrosynthesis** → `retrosynthesis.md` 6. **Model selection** → `models_architectures.md` 7. **Dataset selection** → `datasets.md` 8. **Technical details** → `core_concepts.md` Each reference provides comprehensive coverage of its domain with examples, best practices, and common use cases.