{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Homepage: https://spkit.github.io\n", "
Nikesh Bajaj : http://nikeshbajaj.in" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# DecisionTrees without converting Catogorical Features using SpKit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: Most of ML libraries force us to convert the catogorycal features into one-hot vector or any numerical value. However, it should not be the case, **Not atleast with Decision Trees**, due a simple reason, of how decision tree works. In **spkit library**, Decision tree can handle mixed type input features, 'Catogorical' and 'Numerical'. In this notebook, I would use a dataset *hurricNamed* from *vincentarelbundock* github repository, and use only a few features, mixed of catogorical and numerical features. Converting number of deaths to binary with threshold of 10, we handle this as Classification Problem. \n", "\n", "However, it is not shown that coverting features into one-hot vector or any label encoder affects the performance of model, but, it is useful, when you need to visulize the decision process. Very important when you need to extract and simplify the decision rule." ] }, { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Libraries and Dataset" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "#from sklearn.model_selection import train_test_split\n", "import copy" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [], "source": [ "np.random.seed(100) # just to ensure the reproducible results" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0.0.9'" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import spkit\n", "spkit.__version__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classification - binary class - hurricNamed Dataset" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [], "source": [ "from spkit.ml import ClassificationTree" ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [], "source": [ "D = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/DAAG/hurricNamed.csv')\n", "feature_names = ['Name', 'AffectedStates','LF.WindsMPH','mf']#,'BaseDamage' 'Year','Year','LF.WindsMPH', 'LF.PressureMB','LF.times',\n", "\n", "\n", "X = np.array(D[feature_names])\n", "X[:,1] = [st.split(',')[0] for st in X[:,1]] #Choosing only first name of state from AffectedStates feature\n", "y = np.array(D[['deaths']])[:,0]\n", "\n", "# Converting target into binary with threshold of 10 deaths\n", "y[y<10] =0\n", "y[y>=10]=1" ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([['Easy', 'FL', 120, 'f'],\n", " ['King', 'FL', 130, 'm'],\n", " ['Able', 'SC', 85, 'm'],\n", " ['Barbara', 'NC', 85, 'f'],\n", " ['Florence', 'FL', 85, 'f'],\n", " ['Carol', 'NC', 120, 'f'],\n", " ['Edna', 'MA', 120, 'f'],\n", " ['Hazel', 'SC', 145, 'f'],\n", " ['Connie', 'NC', 120, 'f'],\n", " ['Diane', 'NC', 85, 'f'],\n", " ['Ione', 'NC', 120, 'm'],\n", " ['Flossy', 'LA', 105, 'f'],\n", " ['Audrey', 'TX', 145, 'f'],\n", " ['Helene', 'NC', 120, 'f'],\n", " ['Debra', 'TX', 85, 'f'],\n", " ['Gracie', 'SC', 120, 'f'],\n", " ['Donna', 'FL', 145, 'f'],\n", " ['Ethel', 'MS', 85, 'f'],\n", " ['Carla', 'TX', 145, 'f'],\n", " ['Cindy', 'TX', 85, 'f'],\n", " ['Cleo', 'FL', 105, 'f'],\n", " ['Dora', 'FL', 105, 'f'],\n", " ['Hilda', 'LA', 120, 'f'],\n", " ['Isbell', 'FL', 105, 'f'],\n", " ['Betsy', 'FL', 120, 'f'],\n", " ['Alma', 'FL', 105, 'f'],\n", " ['Inez', 'FL', 85, 'f'],\n", " ['Beulah', 'TX', 120, 'f'],\n", " ['Gladys', 'FL', 105, 'f'],\n", " ['Camille', 'LA', 190, 'f'],\n", " ['Celia', 'TX', 120, 'f'],\n", " ['Fern', 'TX', 85, 'f'],\n", " ['Edith', 'LA', 105, 'f'],\n", " ['Ginger', 'NC', 85, 'f'],\n", " ['Agnes', 'FL', 85, 'f'],\n", " ['Carmen', 'LA', 120, 'f'],\n", " ['Eloise', 'FL', 120, 'f'],\n", " ['Belle', 'NY', 85, 'f'],\n", " ['Babe', 'LA', 85, 'f'],\n", " ['Bob', 'LA', 85, 'm'],\n", " ['David', 'FL', 105, 'm'],\n", " ['Frederic', 'AL', 120, 'm'],\n", " ['Allen', 'TX', 115, 'm'],\n", " ['Alicia', 'TX', 115, 'f'],\n", " ['Diana', 'NC', 110, 'f'],\n", " ['Bob', 'SC', 75, 'm'],\n", " ['Danny', 'LA', 90, 'm'],\n", " ['Elena', 'MS', 115, 'f'],\n", " ['Gloria', 'NC', 120, 'f'],\n", " ['Juan', 'LA', 85, 'm'],\n", " ['Kate', 'FL', 100, 'f'],\n", " ['Bonnie', 'TX', 85, 'f'],\n", " ['Charley', 'NC', 75, 'm'],\n", " ['Floyd', 'FL', 75, 'm'],\n", " ['Florence', 'LA', 80, 'f'],\n", " ['Chantal', 'TX', 80, 'f'],\n", " ['Hugo', 'SC', 140, 'm'],\n", " ['Jerry', 'TX', 85, 'm'],\n", " ['Bob', 'RI', 105, 'm'],\n", " ['Andrew', 'FL', 170, 'm'],\n", " ['Emily', 'NC', 115, 'f'],\n", " ['Erin', 'FL', 100, 'f'],\n", " ['Opal', 'FL', 115, 'f'],\n", " ['Bertha', 'NC', 105, 'f'],\n", " ['Fran', 'NC', 115, 'f'],\n", " ['Danny', 'LA', 80, 'm'],\n", " ['Bonnie', 'NC', 110, 'f'],\n", " ['Earl', 'FL', 80, 'm'],\n", " ['Georges', 'FL', 105, 'm'],\n", " ['Bret', 'TX', 115, 'm'],\n", " ['Floyd', 'NC', 105, 'm'],\n", " ['Irene', 'FL', 80, 'f'],\n", " ['Lili', 'LA', 90, 'f'],\n", " ['Claudette', 'TX', 90, 'f'],\n", " ['Isabel', 'NC', 105, 'f'],\n", " ['Alex', 'NC', 80, 'm'],\n", " ['Charley', 'FL', 150, 'm'],\n", " ['Gaston', 'SC', 75, 'm'],\n", " ['Frances', 'FL', 105, 'f'],\n", " ['Ivan', 'AL', 120, 'm'],\n", " ['Jeanne', 'FL', 120, 'f'],\n", " ['Cindy', 'LA', 75, 'f'],\n", " ['Dennis', 'FL', 120, 'm'],\n", " ['Katrina', 'LA', 125, 'f'],\n", " ['Ophelia', 'NC', 75, 'f'],\n", " ['Rita', 'LA', 115, 'f'],\n", " ['Wilma', 'FL', 120, 'f'],\n", " ['Humberto', 'TX', 90, 'm'],\n", " ['Dolly', 'TX', 85, 'f'],\n", " ['Gustav', 'LA', 105, 'm'],\n", " ['Ike', 'TX', 110, 'm'],\n", " ['Irene', 'NC', 75, 'f'],\n", " ['Isaac', 'LA', 80, 'm'],\n", " ['Sandy', 'NY', 75, 'f']], dtype=object),\n", " array([0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0,\n", " 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1,\n", " 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1,\n", " 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,\n", " 0, 1, 1, 1, 0, 1], dtype=int64))" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X,y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training, " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not doing training and testing, as objective is to show that it works" ] }, { "cell_type": "code", "execution_count": 143, "metadata": {}, "outputs": [], "source": [ "clf = ClassificationTree(max_depth=4)\n", "clf.fit(X,y,feature_names=feature_names)\n", "yp = clf.predict(X)" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(8,5))\n", "clf.plotTree()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, it can be seen that first feature 'LF.WindsMPH' is a numerical, thus the threshold is greater then equal to, however for catogorical features like 'Name' and 'AffectedStates'threshold is equal to only." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "py36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "273.188px" }, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }