{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# IsoTree to TreeLite\n", "\n", "This is a short example of converting an Isolation Forest model generated through the [isotree](https://github.com/david-cortes/isotree) library to [treelite](https://treelite.readthedocs.io/en/latest/index.html) format, which can then be passed to the [tl2cgen](https://tl2cgen.readthedocs.io/en/latest/) library to compile these trees to a standalone runtime library which is oftentimes faster ar making predictions.\n", "\n", "** *\n", "### Getting some medium-size data from scikit-learn to fit a model" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(20640, 8)\n" ] } ], "source": [ "from sklearn.datasets import fetch_california_housing\n", "\n", "X, y = fetch_california_housing(return_X_y=True)\n", "print(X.shape)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Fitting an isolation forest model through isotree\n", "\n", "*Note: only models that use `ndim=1` can be exported to `treelite` format.*" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from isotree import IsolationForest\n", "\n", "iso = IsolationForest(ndim=1, ntrees=100, sample_size=256,\n", " missing_action=\"impute\", max_depth=8)\n", "iso.fit(X)\n", "\n", "### Now convert\n", "treelite_model = iso.to_treelite()\n", "\n", "### OPTIONAL: add annotations for better branch prediction\n", "import tl2cgen\n", "tl2cgen.annotate_branch(\n", " model=treelite_model,\n", " dmat=tl2cgen.DMatrix(X),\n", " verbose=False,\n", " path=\"iso_branches_annotation.json\"\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Compiling the treelite model\n", "\n", "These models need to be compiled into a shared library in order to be used:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "import tl2cgen\n", "import multiprocessing\n", "\n", "tl2cgen.export_lib(\n", " model=treelite_model,\n", " toolchain=\"clang\",\n", " libpath='./predictor.so',\n", " params={\n", " \"parallel_comp\": multiprocessing.cpu_count(),\n", " \"annotate_in\": \"iso_branches_annotation.json\"\n", " }\n", ")\n", "treelite_predictor = tl2cgen.Predictor(\"predictor.so\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now verify that they make the same predictions:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.47006444, 0.47770081, 0.4910637 , 0.42605826, 0.41548625,\n", " 0.41730139, 0.41699421, 0.43228664, 0.40877799, 0.41800632])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iso.predict(X[:10])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.47006445],\n", " [0.47770081],\n", " [0.4910637 ],\n", " [0.42605827],\n", " [0.41548626],\n", " [0.41730139],\n", " [0.41699421],\n", " [0.43228664],\n", " [0.40877799],\n", " [0.41800632]])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "treelite_predictor.predict(tl2cgen.DMatrix(X[:10]))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "*Note: some small disagreement between the two is expected due to loss of precision when converting. See the documentation in `isotree` for more details.*" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Comparing prediction times" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "20.4 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "%%timeit\n", "import joblib\n", "### see docs for 'IsolationForest.predict' about this part\n", "iso.set_params(nthreads=joblib.cpu_count(only_physical_cores=True))\n", "iso.predict(X)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4.14 ms ± 620 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "%%timeit\n", "treelite_predictor.predict(tl2cgen.DMatrix(X))" ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }