{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Gaussian processes on discrete data structures\n\nThis example illustrates the use of Gaussian processes for regression and\nclassification tasks on data that are not in fixed-length feature vector form.\nThis is achieved through the use of kernel functions that operates directly\non discrete structures such as variable-length sequences, trees, and graphs.\n\nSpecifically, here the input variables are some gene sequences stored as\nvariable-length strings consisting of letters 'A', 'T', 'C', and 'G',\nwhile the output variables are floating point numbers and True/False labels\nin the regression and classification tasks, respectively.\n\nA kernel between the gene sequences is defined using R-convolution [1]_ by\nintegrating a binary letter-wise kernel over all pairs of letters among a pair\nof strings.\n\nThis example will generate three figures.\n\nIn the first figure, we visualize the value of the kernel, i.e. the similarity\nof the sequences, using a colormap. Brighter color here indicates higher\nsimilarity.\n\nIn the second figure, we show some regression result on a dataset of 6\nsequences. Here we use the 1st, 2nd, 4th, and 5th sequences as the training set\nto make predictions on the 3rd and 6th sequences.\n\nIn the third figure, we demonstrate a classification model by training on 6\nsequences and make predictions on another 5 sequences. The ground truth here is\nsimply whether there is at least one 'A' in the sequence. Here the model makes\nfour correct classifications and fails on one.\n\n.. [1] Haussler, D. (1999). Convolution kernels on discrete structures\n (Vol. 646). Technical report, Department of Computer Science, University\n of California at Santa Cruz.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nfrom sklearn.base import clone\nfrom sklearn.gaussian_process import GaussianProcessClassifier, GaussianProcessRegressor\nfrom sklearn.gaussian_process.kernels import GenericKernelMixin, Hyperparameter, Kernel\n\n\nclass SequenceKernel(GenericKernelMixin, Kernel):\n \"\"\"\n A minimal (but valid) convolutional kernel for sequences of variable\n lengths.\"\"\"\n\n def __init__(self, baseline_similarity=0.5, baseline_similarity_bounds=(1e-5, 1)):\n self.baseline_similarity = baseline_similarity\n self.baseline_similarity_bounds = baseline_similarity_bounds\n\n @property\n def hyperparameter_baseline_similarity(self):\n return Hyperparameter(\n \"baseline_similarity\", \"numeric\", self.baseline_similarity_bounds\n )\n\n def _f(self, s1, s2):\n \"\"\"\n kernel value between a pair of sequences\n \"\"\"\n return sum(\n [1.0 if c1 == c2 else self.baseline_similarity for c1 in s1 for c2 in s2]\n )\n\n def _g(self, s1, s2):\n \"\"\"\n kernel derivative between a pair of sequences\n \"\"\"\n return sum([0.0 if c1 == c2 else 1.0 for c1 in s1 for c2 in s2])\n\n def __call__(self, X, Y=None, eval_gradient=False):\n if Y is None:\n Y = X\n\n if eval_gradient:\n return (\n np.array([[self._f(x, y) for y in Y] for x in X]),\n np.array([[[self._g(x, y)] for y in Y] for x in X]),\n )\n else:\n return np.array([[self._f(x, y) for y in Y] for x in X])\n\n def diag(self, X):\n return np.array([self._f(x, x) for x in X])\n\n def is_stationary(self):\n return False\n\n def clone_with_theta(self, theta):\n cloned = clone(self)\n cloned.theta = theta\n return cloned\n\n\nkernel = SequenceKernel()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sequence similarity matrix under the kernel\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nX = np.array([\"AGCT\", \"AGC\", \"AACT\", \"TAA\", \"AAA\", \"GAACA\"])\n\nK = kernel(X)\nD = kernel.diag(X)\n\nplt.figure(figsize=(8, 5))\nplt.imshow(np.diag(D**-0.5).dot(K).dot(np.diag(D**-0.5)))\nplt.xticks(np.arange(len(X)), X)\nplt.yticks(np.arange(len(X)), X)\nplt.title(\"Sequence similarity under the kernel\")\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regression\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X = np.array([\"AGCT\", \"AGC\", \"AACT\", \"TAA\", \"AAA\", \"GAACA\"])\nY = np.array([1.0, 1.0, 2.0, 2.0, 3.0, 3.0])\n\ntraining_idx = [0, 1, 3, 4]\ngp = GaussianProcessRegressor(kernel=kernel)\ngp.fit(X[training_idx], Y[training_idx])\n\nplt.figure(figsize=(8, 5))\nplt.bar(np.arange(len(X)), gp.predict(X), color=\"b\", label=\"prediction\")\nplt.bar(training_idx, Y[training_idx], width=0.2, color=\"r\", alpha=1, label=\"training\")\nplt.xticks(np.arange(len(X)), X)\nplt.title(\"Regression on sequences\")\nplt.legend()\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classification\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_train = np.array([\"AGCT\", \"CGA\", \"TAAC\", \"TCG\", \"CTTT\", \"TGCT\"])\n# whether there are 'A's in the sequence\nY_train = np.array([True, True, True, False, False, False])\n\ngp = GaussianProcessClassifier(kernel)\ngp.fit(X_train, Y_train)\n\nX_test = [\"AAA\", \"ATAG\", \"CTC\", \"CT\", \"C\"]\nY_test = [True, True, False, False, False]\n\nplt.figure(figsize=(8, 5))\nplt.scatter(\n np.arange(len(X_train)),\n [1.0 if c else -1.0 for c in Y_train],\n s=100,\n marker=\"o\",\n edgecolor=\"none\",\n facecolor=(1, 0.75, 0),\n label=\"training\",\n)\nplt.scatter(\n len(X_train) + np.arange(len(X_test)),\n [1.0 if c else -1.0 for c in Y_test],\n s=100,\n marker=\"o\",\n edgecolor=\"none\",\n facecolor=\"r\",\n label=\"truth\",\n)\nplt.scatter(\n len(X_train) + np.arange(len(X_test)),\n [1.0 if c else -1.0 for c in gp.predict(X_test)],\n s=100,\n marker=\"x\",\n facecolor=\"b\",\n linewidth=2,\n label=\"prediction\",\n)\nplt.xticks(np.arange(len(X_train) + len(X_test)), np.concatenate((X_train, X_test)))\nplt.yticks([-1, 1], [False, True])\nplt.title(\"Classification on sequences\")\nplt.legend()\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }