--- name: sar-analysis description: Structure-activity relationship (SAR) analysis guide for drug discovery including molecular descriptor analysis, scaffold analysis, and activity cliff detection. license: open --- # SAR Analysis --- ## Metadata **Short Description**: Comprehensive guide for performing Structure-Activity Relationship (SAR) analysis using RDKit. **Authors**: Ohagent Team **Version**: 1.0 **Last Updated**: December 2025 **License**: CC BY 4.0 **Commercial Use**: ✅ Allowed --- ## Overview Structure-Activity Relationship (SAR) analysis is a core medicinal-chemistry workflow that relates systematic structural variations of a chemical series to changes in biological activity. The goal is to (1) identify a common scaffold (Maximum Common Substructure, MCS) shared by a series of analogues, (2) decompose each molecule into the scaffold plus its R-group substituents, (3) align all molecules so substituents at equivalent positions are visually comparable, and (4) connect substituent variation to potency to derive testable design hypotheses. This guide formalizes a reproducible RDKit-based SAR workflow that produces an interactive HTML report (compound table with aligned core/R-groups and an activity heatmap) and a written SAR narrative that explicitly contrasts substituents at the same R-position. It is intended for use on activity tables containing SMILES, a compound identifier, and a numeric potency value (IC50, Ki, EC50, %inhibition, etc.). ## Key Concepts ### Maximum Common Substructure (MCS) MCS is the largest connected substructure shared by all (or a configurable threshold of) molecules in a set. RDKit's `rdFMCS.FindMCS` searches for this scaffold under tunable atom/bond comparison rules. For SAR, MCS provides the anchor template against which every analogue is decomposed and aligned. A `threshold=0.8` allows MCS to be defined when only 80% of molecules contain the candidate substructure, which is more robust to outliers than `threshold=1.0`. `ringMatchesRingOnly=True` and `completeRingsOnly=True` prevent partial-ring fragments that look chemically meaningless. ### R-Group Decomposition R-group decomposition (`rdRGroupDecomposition.RGroupDecompose`) maps each molecule onto the MCS core and assigns the non-core fragments to enumerated R-positions (R1, R2, …). The output is a per-molecule dictionary `{Core, R1, R2, …}`. Constant R-positions (where every molecule carries the same fragment) are uninformative for SAR and should be pruned from the report so attention focuses on the variable positions that actually drive activity. ### Substructure Alignment for Comparable 2D Depiction For SAR visualization to be interpretable, the core and each R-group must be drawn in the same orientation as the parent molecule. The recommended pattern uses three fall-back strategies in order: (1) a direct `GetSubstructMatch`, (2) a re-match after `AdjustQueryProperties(makeDummiesQueries=True)` so R-group dummy atoms are treated as queries, and (3) a final attempt with `useChirality=False`. Once a match is found, atom coordinates are copied from the parent conformer onto the fragment. Without this, R-group cells are drawn in arbitrary canonical orientations and visual SAR is essentially impossible to read. ### Activity Heatmap and Comparative Analysis A logarithmic-scale color gradient (green = high potency / low IC50, red = low potency / high IC50) on the activity column lets a reader spot trends across the series at a glance. The accompanying narrative must justify every claim about a substituent's effect by *explicit pairwise contrast at the same R-position* — the unit of SAR evidence is "compound A (R1=X, IC50=…) vs compound B (R1=Y, IC50=…)", never an unsupported generalization. ## Decision Framework ``` SAR analysis pipeline └── Have SMILES + activity for >= 4 analogues? ├── No -> Insufficient data; collect more analogues first └── Yes -> Run rdFMCS.FindMCS(threshold=0.8, ringMatchesRingOnly=True) ├── MCS too small (<5 atoms) -> Series is too diverse; │ cluster first, then run SAR per cluster └── MCS reasonable -> RGroupDecompose └── For each fragment alignment to parent: ├── Strategy 1: GetSubstructMatch(direct) -- works for canonical cases │ └── No match -> Strategy 2 ├── Strategy 2: AdjustQueryProperties(makeDummiesQueries=True) │ -- handles dummy R-group atoms │ └── No match -> Strategy 3 ├── Strategy 3: GetSubstructMatch(useChirality=False) │ -- handles stereochemistry mismatches │ └── No match -> Compute2DCoords as fallback (lose alignment) └── Drop constant R-positions, build HTML, draw with DrawMoleculeACS1996 ``` | Situation | Recommended choice | Rationale | |-----------|--------------------|-----------| | Standard congeneric series with a clear scaffold | MCS `threshold=0.8`, `ringMatchesRingOnly=True`, `completeRingsOnly=True` | Tolerates a small minority of outliers while keeping rings intact | | Highly diverse set (e.g., HTS hit list) | Cluster (Tanimoto/Murcko) first, then SAR per cluster | A single MCS will be too small to be useful across diverse chemotypes | | Stereoisomers in the series | Try Strategy 1 first; fall back to Strategy 3 (`useChirality=False`) | Chirality differences should not break depiction alignment | | Analogues with R-group attachment dummies in queries | Strategy 2 with `AdjustQueryProperties(makeDummiesQueries=True)` | Dummy atoms are treated as queries so they match real heavy atoms | | One R-position constant across all analogues | Drop from report and from core depiction | Constant positions are uninformative and clutter the table | | Activity spans many orders of magnitude | Color heatmap on `log10(activity)` | Linear scale collapses the dynamic range visually | | Drawing for publication or report | `DrawMoleculeACS1996` via `MolDraw2DSVG` | ACS1996 is the de facto standard for medicinal chemistry figures | ## Best Practices 1. **Inspect the dataframe before assuming column names.** Real-world activity tables vary; auto-detect the SMILES, activity, and ID columns from `df.head()` rather than hard-coding names. This avoids silent failures on user data. 2. **Add explicit hydrogens before MCS.** `Chem.AddHs` lets MCS reason correctly about heavy-atom valence and ring closures; without it, otherwise-identical scaffolds can be missed. 3. **Prune constant R-positions.** Any R-position whose fragment is identical across every analogue contributes no SAR information; remove that column from the table and remove the constant attachment point from the core depiction so the variable positions stand out. 4. **Always align fragments to the parent molecule, not the other way around.** Copy coordinates *from the parent* onto each fragment via the matched atom map. Drawing the parent canonically and then re-drawing fragments from scratch loses comparability between rows. 5. **Use a log-scale activity heatmap.** Potency typically spans 2–4 orders of magnitude; a linear color scale collapses the interesting low-IC50 region. Map green to low IC50 (high potency) and red to high IC50 (low potency). 6. **Justify every SAR claim with a pairwise contrast at the same R-position.** Statements like "small electron-withdrawing groups improve activity at R1" must be backed by a direct comparison such as "compound 7 (R1=F, IC50=0.5 µM) vs compound 1 (R1=Me, IC50=5.2 µM)". Unsupported generalizations are not acceptable evidence. 7. **Test 3-4 analogues per design hypothesis.** A single substitution change can be confounded by experimental noise; multiple analogues at the same position give a defensible trend. 8. **Render with `DrawMoleculeACS1996`.** ACS1996 styling produces consistent bond lengths, atom labels, and font choices that match medicinal-chemistry publication norms; avoid mixing styles within a single report. ## Common Pitfalls - **Pitfall: Hard-coding column names like `"SMILES"` or `"IC50"`.** Different vendors and ELNs export different headers; the script breaks on the first user that uses `Smiles` or `Standard Value`. - *How to avoid*: Inspect `df.columns` and `df.head()` and detect the SMILES/activity/ID columns by content (valid SMILES parse rate, numeric values, unique strings). - **Pitfall: Skipping `Chem.AddHs` before MCS.** Implicit-H molecules can yield a smaller-than-expected MCS because valence and ring perception differ. - *How to avoid*: Always preprocess with `mols_for_mcs = [Chem.AddHs(m) for m in mols]` before calling `rdFMCS.FindMCS`. - **Pitfall: Setting `threshold=1.0` on a noisy series.** A single outlier with an unusual scaffold collapses the MCS to a tiny fragment and ruins R-group decomposition for everyone else. - *How to avoid*: Use `threshold=0.8` (or lower) so the MCS is defined when 80% of the series contains it; review the outlier(s) separately. - **Pitfall: Drawing each fragment with `Compute2DCoords` independently.** Each fragment receives its own canonical 2D layout, so equivalent atoms appear in different positions across rows and visual SAR becomes unreadable. - *How to avoid*: Match each fragment to the parent (with the 3-strategy fallback) and copy coordinates from the parent's conformer onto the fragment's conformer. - **Pitfall: Failing on dummy R-group atoms.** `GetSubstructMatch` returns no match when the fragment contains R-group dummy atoms (`*`) because dummies are not treated as queries by default. - *How to avoid*: Apply `AdjustQueryProperties(params)` with `makeDummiesQueries=True` before retrying the match (Strategy 2). - **Pitfall: Reporting only a single error metric (e.g., mean only).** A trend reported without dispersion is not interpretable; equally, claims about substituent effects without same-position contrasts are not SAR. - *How to avoid*: For every R-position, list each unique substituent and the activities of the compounds carrying it; derive every claim from a pairwise comparison. - **Pitfall: Using a linear-scale heatmap on IC50 in nM.** Most of the interesting potency range collapses into one or two color bins. - *How to avoid*: Color by `log10(IC50)` or `pIC50 = -log10(IC50_in_M)`; this gives uniform color separation across orders of magnitude. - **Pitfall: Treating every column of the R-group decomposition as a SAR axis.** Constant R-positions (every analogue has the same fragment) and the core itself are not SAR variables. - *How to avoid*: After decomposition, programmatically drop columns where every entry is identical and remove those attachment points from the core image. ## Workflow You are an expert in Cheminformatics and Python. Perform a SAR (Structure-Activity Relationship) analysis using RDKit. **Task Requirements:** 1. **Data Loading:** Load the CSV file. Do not assume fixed column names. Instead, inspect the dataframe (e.g., using `df.head()`) to automatically identify columns for Compound Key (e.g., 'Compound Key', 'ID', 'Name'), Activity (e.g., 'Standard Value', 'IC50', 'Activity'), and SMILES (e.g., 'Smiles', 'SMILES', 'Structure'). 2. **Core Identification (MCS):** * Use `rdFMCS.FindMCS` to find a significant common scaffold. * **Pre-processing:** Apply `Chem.AddHs` to molecules before finding MCS. * **Reference Code:** Use the following parameter settings for robust core identification: ```python mols_for_mcs = [Chem.AddHs(m) for m in mols] mcs_res = rdFMCS.FindMCS( mols_for_mcs, threshold=0.8, ringMatchesRingOnly=True, completeRingsOnly=True, atomCompare=rdFMCS.AtomCompare.CompareElements, bondCompare=rdFMCS.BondCompare.CompareOrder ) core_mol = Chem.MolFromSmarts(mcs_res.smartsString) AllChem.Compute2DCoords(core_mol) ``` 3. **R-Group Decomposition & Refinement:** * Perform decomposition based on the Core. * **Refinement:** Exclude any R-group columns that are identical (constant) across all molecules. Remove these constant points from the Core visualization as well. 4. **Image Generation & Alignment (Strict Coordinate Extraction):** * **Goal:** Ensure Core and R-groups are visually perfectly superimposed on the Original Molecule. * **Drawing Style:** When drawing molecules, always use DrawMoleculeACS1996 for consistent and professional visualization: ```python from rdkit.Chem.Draw import rdMolDraw2D drawer = rdMolDraw2D.MolDraw2DSVG(-1, -1) rdMolDraw2D.DrawMoleculeACS1996(drawer, mol) drawer.FinishDrawing() svg = drawer.GetDrawingText() svg = svg.replace("width='", "width='100%' data-original-width='") svg = svg.replace("height='", "height='100%' data-original-height='") ``` * **Reference Implementation:** Use this specific alignment logic to guarantee perfect overlay: ```python matches, unmatched_indices = rdRGroupDecomposition.RGroupDecompose([core_mol], mols, asSmiles=False, asRows=False) ``` ```python def align_substructure_to_parent(sub, parent): if not sub or not parent: return False try: # Strategy 1: Direct match match = parent.GetSubstructMatch(sub) # Strategy 2: Convert dummies to queries (handle R-group attachment points) if not match: params = Chem.AdjustQueryParameters() params.makeDummiesQueries = True params.adjustDegree = False params.adjustRingCount = False sub_query = Chem.AdjustQueryProperties(sub, params) match = parent.GetSubstructMatch(sub_query) # Strategy 3: Try without chirality if not match: match = parent.GetSubstructMatch(sub, useChirality=False) if match: conf_parent = parent.GetConformer() conf_sub = Chem.Conformer(sub.GetNumAtoms()) for sub_idx, parent_idx in enumerate(match): pos = conf_parent.GetAtomPosition(parent_idx) conf_sub.SetAtomPosition(sub_idx, pos) sub.RemoveAllConformers() sub.AddConformer(conf_sub) return True except: pass return False # Usage in loop: # 1. Align Original Molecule to Core template try: AllChem.GenerateDepictionMatching2DStructure(m, core_mol) except: AllChem.Compute2DCoords(m) # 2. Align fragments (Core/R-groups) to Original Molecule # Copy coords FROM original molecule TO fragment if not align_substructure_to_parent(fragment, m): AllChem.Compute2DCoords(fragment) ``` ```python match_core = matches['Core'][i] align_substructure_to_parent(this_core, mol) core_img = mol_to_base64(this_core) ``` 5. **HTML Output (`sar_analysis_report.html`):** * **Design:** Create a clean, modern, and visually appealing HTML page using CSS styling. Use modern CSS features (e.g., subtle shadows, smooth transitions, clean typography, proper color schemes, responsive design) to enhance readability and visual appeal. **Crucially, ensure that the table column widths are large enough to display structures clearly. Set a `min-width` of at least 300px (e.g., `min-width: 300px;`) for the columns containing images (Original, Core, R-groups) so that the molecules are not shrunk and remain easily recognizable.** * **Table Structure:** `Compound Key`, `Activity`, `Original Molecule`, `Core`, and variable R-groups. * **Activity Heatmap:** Apply a background color gradient to Activity cells using a logarithmic scale (Green for low values/high potency, Red for high values/low potency). * **Image Handling:** * Convert molecules to **SVG** (preferred) or Base64 PNG strings. * **Validation:** Check if image generation was successful. Only embed valid images; otherwise, use a text placeholder (`