{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using Isolation Forests as kernels for SVM\n",
    "\n",
    "This is a short example about using the [isotree](https://www.github.com/david-cortes/isotree) library for fitting an isolation forest model and using this (unsupervised) fitted model for calculating the similarities between each pair of observations, which in turn can be used as a kernel for SVM (support vector machines) for supervised learning tasks.\n",
    "\n",
    "By default, the library calculates a distance metric between observations which is bounded between zero and one. Having these bounds, it can be easily turned into a similarity metric by simply calculating one minus this distance. This similarity metric satisfies the properties of a Hilbert space (https://en.wikipedia.org/wiki/Hilbert_space), being possible to use it as a kernel for support vector machines or as a feature generator.\n",
    "\n",
    "The library includes a function `set_reference_points` which can be used for repeated distance calculations against the same points. Note that this is however a typically very slow and memory-heavy operation, and as such is not recommended for large datasets.\n",
    "\n",
    "The example uses the \"splice scale\" dataset, which can be downloaded from [LibSVM dataset](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). Note that, despite being offered in a sparse matrix format, the data is actually dense.\n",
    "\n",
    "** *\n",
    "### Reading the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1000, 60)\n"
     ]
    }
   ],
   "source": [
    "from readsparse import read_sparse\n",
    "\n",
    "splice = read_sparse(\"splice_scale.txt\")\n",
    "y = (splice[\"y\"] == 1).astype(\"float64\")\n",
    "X = splice[\"X\"].toarray()\n",
    "print(X.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining a kernel transformer from isolation forest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from isotree import IsolationForest\n",
    "from sklearn.base import TransformerMixin, BaseEstimator\n",
    "\n",
    "class IsoDistKernel(TransformerMixin, BaseEstimator):\n",
    "    \n",
    "    def __init__(self, isotree_params: dict = {}):\n",
    "        self.isotree_params = isotree_params\n",
    "    \n",
    "    def fit(self, X, y=None, sample_weights=None):\n",
    "        self.iso_ = IsolationForest(**self.isotree_params)\n",
    "        self.iso_.fit(X).set_reference_points(X, with_distances=True)\n",
    "        return self\n",
    "    \n",
    "    def transform(self, X):\n",
    "        D = self.iso_.predict_distance(X, use_reference_points=True)\n",
    "        return 1 - D"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluating results with this kernel\n",
    "\n",
    "Note that while typically most SVM libraries manually calculate a set of predefined kernels, software such as LibSVM (and by extension scikit-learn which uses it behind the hood) or ThunderSVM allow passing a precomputed kernel as data instead of the original points, which is a square matrix with dimension equal to the number of observations.\n",
    "\n",
    "The results here are evaluated by a randomized and stratified 5-fold cross-validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cross-validation results (distance isolation kernel):\n",
      "{'fit_time': array([0.11507702, 0.10557008, 0.11134267, 0.1177721 , 0.0980866 ]),\n",
      " 'score_time': array([0.02970052, 0.0232563 , 0.02420211, 0.02198601, 0.03312016]),\n",
      " 'test_score': array([0.96694712, 0.97225561, 0.96546892, 0.98168352, 0.96957262])}\n",
      "----\n",
      "Mean CV AUROC: 0.9712\n"
     ]
    }
   ],
   "source": [
    "from sklearn.svm import SVC\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "from sklearn.model_selection import cross_validate\n",
    "from pprint import pprint\n",
    "\n",
    "model = make_pipeline(\n",
    "    IsoDistKernel({\n",
    "        \"ndim\":1,\n",
    "        \"sample_size\":256,\n",
    "        \"ntrees\":250,\n",
    "        \"missing_action\":\"fail\"\n",
    "    }),\n",
    "    SVC(kernel=\"precomputed\")\n",
    ")\n",
    "cv_res_iso = cross_validate(\n",
    "    estimator=model,\n",
    "    X=X, y=y,\n",
    "    scoring=\"roc_auc\",\n",
    "    n_jobs=1,\n",
    "    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=1)\n",
    ")\n",
    "print(\"Cross-validation results (distance isolation kernel):\")\n",
    "pprint(cv_res_iso)\n",
    "print(\"----\")\n",
    "print(\"Mean CV AUROC: %.4f\" % cv_res_iso[\"test_score\"].mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A natural question is how good was the addition of this kernel compared to something simpler. As will be seen, results are better with the isolation kernel than with the default Gaussian RBF kernel used by this library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cross-validation results (default RBF kernel):\n",
      "{'fit_time': array([0.03547263, 0.0420773 , 0.04024625, 0.03484583, 0.03834677]),\n",
      " 'score_time': array([0.01111746, 0.01246071, 0.01298881, 0.0111289 , 0.01078558]),\n",
      " 'test_score': array([0.95633013, 0.94991987, 0.92773496, 0.96226604, 0.91942748])}\n",
      "----\n",
      "Mean CV AUROC: 0.9431\n"
     ]
    }
   ],
   "source": [
    "### Compare against a simpler kernel\n",
    "cv_res_plain_kernel = cross_validate(\n",
    "    estimator=SVC(),\n",
    "    X=X, y=y,\n",
    "    scoring=\"roc_auc\",\n",
    "    n_jobs=-1,\n",
    "    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=1)\n",
    ")\n",
    "print(\"Cross-validation results (default RBF kernel):\")\n",
    "pprint(cv_res_plain_kernel)\n",
    "print(\"----\")\n",
    "print(\"Mean CV AUROC: %.4f\" % cv_res_plain_kernel[\"test_score\"].mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## More efficient calculation for fitted model\n",
    "\n",
    "While the input to the model is a kernel matrix with number of columns corresponding to the number of observations to which it was fitted, in practice SVM models only end up using a fraction of the total observations in their prediction formula (these are the so-called \"support vectors\").\n",
    "\n",
    "As such, once one has a fitted model and wants to make predictions on new data, it is not necessary (nor beneficial) to calculate distances from the new observations to every single point that was in the training data - only distances with respect to support vectors are needed.\n",
    "\n",
    "The software used here (scikit-learn) unfortunately does not have any option for automatically telling the model methods to \"shrink\" their input requirements, but it is nevertheless easy to re-create the formula manually:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prediction from automated call to 'decision_function':\n",
      "[ 1.84911618  1.247814    1.47801342 -0.41053304 -0.94761292  0.44522017\n",
      "  0.31118329  1.50615986 -0.13875399  0.72681926]\n"
     ]
    }
   ],
   "source": [
    "iso = IsolationForest().fit(X).set_reference_points(X, with_distances=True)\n",
    "K = 1 - iso.predict_distance(X, use_reference_points=True)\n",
    "svm = SVC(kernel=\"precomputed\").fit(K, y)\n",
    "p_auto = svm.decision_function(K[:10]).reshape(-1)\n",
    "print(\"Prediction from automated call to 'decision_function':\")\n",
    "print(p_auto)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of reference points picked: 559\n",
      "Prediction from manual formula using only selected reference points:\n",
      "[ 1.84911618  1.247814    1.47801342 -0.41053304 -0.94761292  0.44522017\n",
      "  0.31118329  1.50615986 -0.13875399  0.72681926]\n"
     ]
    }
   ],
   "source": [
    "idx_used = svm.support_\n",
    "print(\"Number of reference points picked: %d\" % idx_used.shape[0])\n",
    "iso.set_reference_points(X[idx_used], with_distances=True)\n",
    "K_used = 1. - iso.predict_distance(X[:10], use_reference_points=True)\n",
    "p_manual = K_used.dot(svm.dual_coef_.reshape(-1)) + svm.intercept_[0]\n",
    "print(\"Prediction from manual formula using only selected reference points:\")\n",
    "print(p_manual)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sub-sampled kernel\n",
    "\n",
    "While SVM models typically involve efficient optimization routines for square kernel matrices which end up identifying the best reference points (support vectors) to use in the final prediction formula, it is also possible to use the trick with a plain generalized linear model such as logistic regression by instead supplying features that are the kernels with respect to randomly-sampled points within the data.\n",
    "\n",
    "This is faster to calculate as a kernel, but typically the results are not as good quality as when using a full square matrix, since the support vectors are randomly-chosen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cross-validation results (randomly sub-sampled distance isolation kernel):\n",
      "{'fit_time': array([0.06880903, 0.10956693, 0.10929084, 0.12208724, 0.17555356]),\n",
      " 'score_time': array([0.01918626, 0.0191586 , 0.01897311, 0.02019763, 0.03826642]),\n",
      " 'test_score': array([0.96604567, 0.92938702, 0.94334901, 0.96416775, 0.93063757])}\n",
      "----\n",
      "Mean CV AUROC: 0.9467\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "import numpy as np\n",
    "\n",
    "class IsoSubSampledDistKernel(TransformerMixin, BaseEstimator):\n",
    "    \n",
    "    def __init__(self, isotree_params: dict = {}, n_samples=250, random_state=None):\n",
    "        self.isotree_params = isotree_params\n",
    "        self.n_samples = n_samples\n",
    "        self.random_state = random_state\n",
    "    \n",
    "    def fit(self, X, y=None, sample_weights=None):\n",
    "        self.iso_ = IsolationForest(**self.isotree_params)\n",
    "        self.iso_.fit(X)\n",
    "        rng = np.random.default_rng(seed=self.random_state)\n",
    "        idx_random = rng.choice(X.shape[0], size=self.n_samples)\n",
    "        self.iso_.set_reference_points(X[idx_random], with_distances=True)\n",
    "        return self\n",
    "    \n",
    "    def transform(self, X):\n",
    "        D = self.iso_.predict_distance(X, use_reference_points=True)\n",
    "        return 1 - D\n",
    "\n",
    "model_subsampled = make_pipeline(\n",
    "    IsoSubSampledDistKernel({\n",
    "        \"ndim\":1,\n",
    "        \"sample_size\":256,\n",
    "        \"ntrees\":250,\n",
    "        \"missing_action\":\"fail\"\n",
    "    },\n",
    "    n_samples = 250, random_state=456),\n",
    "    LogisticRegression(solver=\"lbfgs\", max_iter=10000)\n",
    ")\n",
    "cv_res_iso_subsampled = cross_validate(\n",
    "    estimator=model_subsampled,\n",
    "    X=X, y=y,\n",
    "    scoring=\"roc_auc\",\n",
    "    n_jobs=1,\n",
    "    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=1)\n",
    ")\n",
    "print(\"Cross-validation results (randomly sub-sampled distance isolation kernel):\")\n",
    "pprint(cv_res_iso_subsampled)\n",
    "print(\"----\")\n",
    "print(\"Mean CV AUROC: %.4f\" % cv_res_iso_subsampled[\"test_score\"].mean())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (OpenBLAS)",
   "language": "python",
   "name": "py3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}