{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Thresholds for \"random\" in fingerprints the RDKit supports\n", "\n", "This is an updated version of a post. The original version of the notebook can be [found in github](https://github.com/greglandrum/rdkit_blog/blob/936b35b77ddca3e843991d1c4b1e3532e754b734/notebooks/Fingerprint%20Thresholds.ipynb).\n", "\n", "A frequent question that comes up when considering fingerprint similarity is: \"What threshold should I use to determine what a neighbor is?\" The answer is poorly defined. Of course it depends heavily on the details of the fingerprint, but there's also a very subjective component: you want to pick a low enough threshold that you're sure you won't miss anything, but you don't want to pick up too much noise.\n", "\n", "The goal here is to systematically come up with some guidelines that can be used for fingerprints supported within the RDKit. We will do that by looking a similarities between random \"drug-like\" (MW<600) molecules picked from ChEMBL.\n", "\n", "For the analysis, the 25K similarity values are sorted and the values at particular threshold are examined. \n", "\n", "There's a fair amount of code and results below, so here's the summary table. To help interpret this: 22500 of the 25000 pairs (90%) have a MACCS keys similarity value less than 0.528. \n", "\n", "\n", "
Fingerprint | Metric | 70% level | 80% level | 90% level | 95% level | 99% level |
---|---|---|---|---|---|---|
MACCS | Tanimoto | \n", "0.431 | \n", "0.471 | \n", "0.528 | \n", "0.575 | \n", "0.655 | \n", "
Morgan0 (counts) | Tanimoto | \n", "0.429 | \n", "0.471 | \n", "0.525 | \n", "0.568 | \n", "0.651 | \n", "
Morgan1 (counts) | Tanimoto | \n", "0.265 | \n", "0.293 | \n", "0.333 | \n", "0.364 | \n", "0.429 | \n", "
Morgan2 (counts) | Tanimoto | \n", "0.181 | \n", "0.201 | \n", "0.229 | \n", "0.252 | \n", "0.305 | \n", "
Morgan3 (counts) | Tanimoto | \n", "0.141 | \n", "0.156 | \n", "0.178 | \n", "0.196 | \n", "0.238 | \n", "
Morgan0 (bits) | Tanimoto | \n", "0.435 | \n", "0.475 | \n", "0.529 | \n", "0.571 | \n", "0.656 | \n", "
Morgan1 (bits) | Tanimoto | \n", "0.273 | \n", "0.301 | \n", "0.341 | \n", "0.371 | \n", "0.434 | \n", "
Morgan2 (bits) | Tanimoto | \n", "0.197 | \n", "0.217 | \n", "0.246 | \n", "0.269 | \n", "0.322 | \n", "
Morgan3 (bits) | Tanimoto | \n", "0.165 | \n", "0.181 | \n", "0.203 | \n", "0.222 | \n", "0.264 | \n", "
FeatMorgan0 (counts) | Tanimoto | \n", "0.583 | \n", "0.630 | \n", "0.690 | \n", "0.737 | \n", "0.818 | \n", "
FeatMorgan1 (counts) | Tanimoto | \n", "0.390 | \n", "0.425 | \n", "0.474 | \n", "0.511 | \n", "0.581 | \n", "
FeatMorgan2 (counts) | Tanimoto | \n", "0.272 | \n", "0.298 | \n", "0.333 | \n", "0.364 | \n", "0.424 | \n", "
FeatMorgan3 (counts) | Tanimoto | \n", "0.209 | \n", "0.228 | \n", "0.256 | \n", "0.279 | \n", "0.328 | \n", "
FeatMorgan0 (bits) | Tanimoto | \n", "0.583 | \n", "0.630 | \n", "0.690 | \n", "0.737 | \n", "0.818 | \n", "
FeatMorgan1 (bits) | Tanimoto | \n", "0.395 | \n", "0.429 | \n", "0.477 | \n", "0.514 | \n", "0.585 | \n", "
FeatMorgan2 (bits) | Tanimoto | \n", "0.284 | \n", "0.310 | \n", "0.347 | \n", "0.376 | \n", "0.434 | \n", "
FeatMorgan3 (bits) | Tanimoto | \n", "0.228 | \n", "0.248 | \n", "0.276 | \n", "0.299 | \n", "0.349 | \n", "
RDKit 4 (bits) | Tanimoto | \n", "0.209 | \n", "0.239 | \n", "0.285 | \n", "0.325 | \n", "0.426 | \n", "
RDKit 5 (bits) | Tanimoto | \n", "0.197 | \n", "0.219 | \n", "0.253 | \n", "0.287 | \n", "0.368 | \n", "
RDKit 6 (bits) | Tanimoto | \n", "0.230 | \n", "0.250 | \n", "0.280 | \n", "0.308 | \n", "0.369 | \n", "
RDKit 7 (bits) | Tanimoto | \n", "0.313 | \n", "0.346 | \n", "0.389 | \n", "0.429 | \n", "0.507 | \n", "
linear RDKit 4 (bits) | Tanimoto | \n", "0.225 | \n", "0.258 | \n", "0.309 | \n", "0.354 | \n", "0.462 | \n", "
linear RDKit 5 (bits) | Tanimoto | \n", "0.198 | \n", "0.225 | \n", "0.269 | \n", "0.309 | \n", "0.404 | \n", "
linear RDKit 6 (bits) | Tanimoto | \n", "0.187 | \n", "0.210 | \n", "0.246 | \n", "0.282 | \n", "0.365 | \n", "
linear RDKit 7 (bits) | Tanimoto | \n", "0.182 | \n", "0.203 | \n", "0.234 | \n", "0.264 | \n", "0.337 | \n", "
Atom Pairs (counts) | Tanimoto | \n", "0.180 | \n", "0.204 | \n", "0.237 | \n", "0.265 | \n", "0.325 | \n", "
Torsions (counts) | Tanimoto | \n", "0.107 | \n", "0.130 | \n", "0.165 | \n", "0.194 | \n", "0.266 | \n", "
Atom Pairs (bits) | Tanimoto | \n", "0.275 | \n", "0.301 | \n", "0.335 | \n", "0.363 | \n", "0.415 | \n", "
Torsions (bits) | Tanimoto | \n", "0.133 | \n", "0.155 | \n", "0.188 | \n", "0.219 | \n", "0.288 | \n", "
Avalon 512 (bits) | Tanimoto | \n", "0.369 | \n", "0.407 | \n", "0.461 | \n", "0.505 | \n", "0.575 | \n", "
Avalon 1024 (bits) | Tanimoto | \n", "0.269 | \n", "0.297 | \n", "0.340 | \n", "0.375 | \n", "0.449 | \n", "
Avalon 512 (counts) | Tanimoto | \n", "0.300 | \n", "0.333 | \n", "0.379 | \n", "0.418 | \n", "0.491 | \n", "
Avalon 1024 (counts) | Tanimoto | \n", "0.267 | \n", "0.299 | \n", "0.344 | \n", "0.384 | \n", "0.462 | \n", "