{ "cells": [ { "cell_type": "markdown", "id": "59a2da02", "metadata": {}, "source": [ "Pat Walters, the writer of the cheminformatics blog [Practical Cheminformatics](https://practicalcheminformatics.blogspot.com), has a [repository](https://github.com/PatWalters/datafiles) for datasets used in his blog posts. As several datasets are based on ChEMBL, there is a benefit to building reproducible workflows for re-generating them using `chembl-downloader`.\n", "\n", "In this notebook, we'll look at a dataset of the small molecule inhibitors of [5-lipoxygenase activating protein (CHEMBL4550)](https://bioregistry.io/chembl:CHEMBL4550). It's available at as a `*.smi` file, which is a CSV file with SMILES strings in the first column arbitarary, application-specific content. In this case, the remaining two columns are a ChEMBL compound identifier and a [pChEMBL](https://chembl.gitbook.io/chembl-interface-documentation/frequently-asked-questions/chembl-data-questions#what-is-pchembl) value. The data for this example can be found at [https://github.com/PatWalters/datafiles/raw/main/CHEMBL4550.smi](https://github.com/PatWalters/datafiles/raw/main/CHEMBL4550.smi)." ] }, { "cell_type": "code", "execution_count": 1, "id": "05258e99", "metadata": {}, "outputs": [], "source": [ "import sys\n", "import time\n", "from collections import defaultdict\n", "\n", "import matplotlib.pyplot as plt\n", "import matplotlib_inline\n", "import pandas as pd\n", "import seaborn as sns\n", "from scipy import stats\n", "from rdkit import Chem\n", "from tqdm.auto import tqdm\n", "from sklearn.decomposition import PCA\n", "\n", "import chembl_downloader\n", "import chembl_downloader.contrib" ] }, { "cell_type": "code", "execution_count": 2, "id": "944902b1", "metadata": {}, "outputs": [], "source": [ "matplotlib_inline.backend_inline.set_matplotlib_formats(\"svg\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "5d66ffc4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3.10.8 (main, Oct 13 2022, 10:17:43) [Clang 14.0.0 (clang-1400.0.29.102)]\n" ] } ], "source": [ "print(sys.version)" ] }, { "cell_type": "code", "execution_count": 4, "id": "c16042f3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sat Oct 29 17:49:20 2022\n" ] } ], "source": [ "print(time.asctime())" ] }, { "cell_type": "markdown", "id": "670fc2b7", "metadata": {}, "source": [ "## Loading Walters' File" ] }, { "cell_type": "code", "execution_count": 5, "id": "c8e0decf", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | canonical_smiles | \n", "molecule_chembl_id | \n", "pchembl_value | \n", "
---|---|---|---|
330 | \n", "Nc1cnc(-c2ccc(C3CCC3)c(Oc3ncccn3)c2F)cn1 | \n", "CHEMBL3586209 | \n", "7.98 | \n", "
486 | \n", "Nc1cnc(-c2ccc(C3CCC3)c(OCc3nc(C(=O)O)co3)c2F)cn1 | \n", "CHEMBL3639581 | \n", "7.53 | \n", "
1268 | \n", "CCC(CC)(Cc1nc2ccc(OCc3ccn(C)n3)cc2n1Cc1ccc(OC(... | \n", "CHEMBL3639611 | \n", "7.35 | \n", "
1315 | \n", "Cn1ccc(COc2ccc3nc([C@@H]4CCCC[C@@H]4C(=O)O)n(C... | \n", "CHEMBL3639658 | \n", "7.52 | \n", "
15 | \n", "O=C(O)[C@@H]1CCCC[C@H]1c1nc2cc(OCc3ccc4ccccc4n... | \n", "CHEMBL3639771 | \n", "7.92 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
422 | \n", "CC(C)(C)c1ccc(-c2cnc(N)nc2)c(F)c1Oc1cc(N)ncn1 | \n", "CHEMBL3704364 | \n", "7.33 | \n", "
423 | \n", "CC(C)(C)c1ccc(-c2cnc(N)nc2)c(F)c1Oc1nccc(N)n1 | \n", "CHEMBL3704365 | \n", "7.63 | \n", "
424 | \n", "COc1c(Br)ccc(-c2cnc(N)cn2)c1F | \n", "CHEMBL3704366 | \n", "5.60 | \n", "
425 | \n", "COc1c(C2CCCC2)ccc(-c2cnc(N)cn2)c1F | \n", "CHEMBL3704367 | \n", "6.28 | \n", "
426 | \n", "Nc1cnc(-c2ccc(C3CCCC3)c(OCc3ccccc3)c2F)cn1 | \n", "CHEMBL3704368 | \n", "7.19 | \n", "
1407 rows × 3 columns
\n", "\n", " | assay_chembl_id | \n", "canonical_smiles | \n", "molecule_chembl_id | \n", "pchembl_value | \n", "
---|---|---|---|---|
3 | \n", "CHEMBL958743 | \n", "CC(=O)Nc1ccc(O)cc1 | \n", "CHEMBL112 | \n", "4.36 | \n", "
186 | \n", "CHEMBL1211525 | \n", "COc1cnc(-c2ccc(Cn3c(CC(C)(C)C(=O)[O-])c(SC(C)(... | \n", "CHEMBL1210423 | \n", "8.70 | \n", "
190 | \n", "CHEMBL1924524 | \n", "COc1cnc(-c2ccc(Cn3c(CC(C)(C)C(=O)O)c(SC(C)(C)C... | \n", "CHEMBL1229205 | \n", "9.22 | \n", "
189 | \n", "CHEMBL1924522 | \n", "COc1cnc(-c2ccc(Cn3c(CC(C)(C)C(=O)O)c(SC(C)(C)C... | \n", "CHEMBL1229205 | \n", "7.09 | \n", "
188 | \n", "CHEMBL1924521 | \n", "COc1cnc(-c2ccc(Cn3c(CC(C)(C)C(=O)O)c(SC(C)(C)C... | \n", "CHEMBL1229205 | \n", "6.79 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
2 | \n", "CHEMBL3783530 | \n", "CC(c1cc2ccccc2s1)N(O)C(N)=O | \n", "CHEMBL93 | \n", "6.24 | \n", "
1 | \n", "CHEMBL1924522 | \n", "CC(c1cc2ccccc2s1)N(O)C(N)=O | \n", "CHEMBL93 | \n", "5.38 | \n", "
0 | \n", "CHEMBL1924521 | \n", "CC(c1cc2ccccc2s1)N(O)C(N)=O | \n", "CHEMBL93 | \n", "5.66 | \n", "
80 | \n", "CHEMBL618418 | \n", "CC1Cc2c(OCc3ccccn3)ccc3c2c(c(CC(C)(C)C(=O)O)n3... | \n", "CHEMBL96412 | \n", "6.96 | \n", "
81 | \n", "CHEMBL618418 | \n", "COc1ccc(COc2ccc3c4c2CC(C)Sc4c(CC(C)(C)C(=O)O)n... | \n", "CHEMBL96611 | \n", "7.40 | \n", "
2663 rows × 4 columns
\n", "\n", " | canonical_smiles | \n", "molecule_chembl_id | \n", "pchembl_value | \n", "new_mean | \n", "new_gmean | \n", "
---|---|---|---|---|---|
1268 | \n", "CCC(CC)(Cc1nc2ccc(OCc3ccn(C)n3)cc2n1Cc1ccc(OC(... | \n", "CHEMBL3639611 | \n", "7.350 | \n", "7.350 | \n", "7.350 | \n", "
475 | \n", "COC(=O)c1ccc2oc(COc3c(C4CCC4)ccc(-c4cnc(N)cn4)... | \n", "CHEMBL3659294 | \n", "7.710 | \n", "7.710 | \n", "7.710 | \n", "
476 | \n", "COC(=O)c1ccc(O)c(NC(=O)COc2c(C3CCC3)ccc(-c3cnc... | \n", "CHEMBL3659295 | \n", "6.860 | \n", "6.860 | \n", "6.860 | \n", "
1233 | \n", "CCC(CC)(Cc1nc2ccc(OCc3ccc(C)cn3)cc2n1Cc1ccc(Br... | \n", "CHEMBL3662241 | \n", "8.000 | \n", "8.000 | \n", "8.000 | \n", "
1234 | \n", "Cc1ccc(COc2ccc3nc(CC4(C(=O)O)CCCC4)n(Cc4ccc(Br... | \n", "CHEMBL3662242 | \n", "7.740 | \n", "7.750 | \n", "7.750 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
605 | \n", "Nc1cnc(-c2ccc(-c3ccccc3S(=O)(=O)N3CC[C@H](N)C3... | \n", "CHEMBL3688481 | \n", "6.695 | \n", "6.695 | \n", "6.692 | \n", "
722 | \n", "Nc1cnc(-c2ccc(-c3ccccc3CSc3nccc(N)n3)cc2F)cn1 | \n", "CHEMBL3688593 | \n", "8.230 | \n", "8.230 | \n", "8.229 | \n", "
739 | \n", "Nc1ncc(-c2ccc(-c3ccccc3S(=O)(=O)C3CC3)cc2F)cn1 | \n", "CHEMBL3693003 | \n", "7.100 | \n", "7.020 | \n", "7.020 | \n", "
762 | \n", "C[C@H](O)CNS(=O)(=O)c1ccccc1-c1ccc(-c2cnc(N)nc... | \n", "CHEMBL3693026 | \n", "7.295 | \n", "7.300 | \n", "7.299 | \n", "
971 | \n", "Nc1cnc(-c2ccc(-c3ccccc3C(=O)N3CCS(=O)(=O)CC3)c... | \n", "CHEMBL3697193 | \n", "6.830 | \n", "6.830 | \n", "6.830 | \n", "
114 rows × 5 columns
\n", "\n", " | canonical_smiles | \n", "molecule_chembl_id | \n", "pchembl_value | \n", "
---|---|---|---|
0 | \n", "Br.O=C(Nc1cccnc1)Oc1ccc(OCc2nc3ccccc3s2)cc1C1(... | \n", "CHEMBL541915 | \n", "8.620 | \n", "
1 | \n", "C.Nc1cnc(-c2ccc(-c3ccccc3S(=O)(=O)N3C[C@H]4C[C... | \n", "CHEMBL4110733 | \n", "6.890 | \n", "
2 | \n", "C1=CC(COc2ccc(OCc3ccc4ccccc4n3)cc2C2(c3ccccc3)... | \n", "CHEMBL255227 | \n", "8.600 | \n", "
3 | \n", "C=C(Cc1ccc(-c2cnc(N)nc2)cc1)c1nc2cc(C#N)ccc2n1... | \n", "CHEMBL4250880 | \n", "6.430 | \n", "
4 | \n", "C=CCOc1ccc2c(c1)nc(C(C)c1ccc(CC(C)C)cc1)n2Cc1c... | \n", "CHEMBL3927809 | \n", "6.770 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
2153 | \n", "c1ccc(COc2ccc(OCc3ccc4ccccc4n3)cc2C2(c3ccccc3)... | \n", "CHEMBL430038 | \n", "8.070 | \n", "
2154 | \n", "c1ccc([C@@]2(c3cc(OCc4ccc5ccccc5n4)ccc3-c3ncon... | \n", "CHEMBL2031650 | \n", "7.205 | \n", "
2155 | \n", "c1ccc([C@@]2(c3cc(OCc4ccc5ccccc5n4)ccc3-c3nn[n... | \n", "CHEMBL2031652 | \n", "6.890 | \n", "
2156 | \n", "c1ccc([C@@]2(c3cc(OCc4ccc5ccccc5n4)ccc3-c3nnco... | \n", "CHEMBL2031649 | \n", "7.480 | \n", "
2157 | \n", "c1ccc2nc(COc3ccc(C4(c5ccc(OCc6ccc7ccccc7n6)cc5... | \n", "CHEMBL257797 | \n", "6.900 | \n", "
2158 rows × 3 columns
\n", "