{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Machine Learning with ibmdbpy and Db2 Warehouse:  Naïve Bayes Classification\n",
    "\n",
    "This notebook explains how to perform Naive Bayes classification with ibmdbpy Python package and data stored in Db2 Warehouse.\n",
    "___\n",
    "\n",
    "### Accelerate Python analytics with in-database processing by using ibmdbpy and IBM Db2 Warehouse\n",
    " \n",
    "The ibmdbpy project provides a Python interface for data manipulation and access to in-database algorithms in IBM Db2 Warehouse. It accelerates Python analytics by seamlessly pushing operations written in Python into the underlying database for execution, thereby benefitting from in-database performance-enhancing features, such as columnar storage and parallel processing. For more details about ibmdbpy, please refer to the [documentation](https://pythonhosted.org/ibmdbpy/index.html) and to the dedicated [Git repository](https://github.com/ibmdbanalytics/ibmdbpy/tree/master/ibmdbpy). This notebook provides you with an overview of ibmdbpy functionalities. \n",
    "\n",
    " \n",
    "__Prerequisites__\n",
    "* Db2 account: see [IBM Cloud](https://cloud.ibm.com/login) or [Db2 Warehouse](https://www.ibm.com/support/knowledgecenter/en/SSCJDQ/com.ibm.swg.im.dashdb.kc.doc/welcome.html)\n",
    "* Db2 driver: learn more on [IBM Knowledge Center](https://www.ibm.com/support/knowledgecenter/en/SSFMBX/com.ibm.swg.im.dashdb.doc/connecting/connect_applications_by_type.html) and see [IBM Support](https://www.ibm.com/support/pages/db2-jdbc-driver-versions-and-downloads)\n",
    "* Having installed the [ibmdbpy](https://pypi.org/project/ibmdbpy/) python library with pip: \n",
    "> pip install ibmdbpy \n",
    "* Optional dependency for JDBC is the [jaydebeapi](https://pypi.org/project/JayDeBeApi/) library. Run the following command to install ibmdbpy, as well as the dependencies for the JDBC feature:\n",
    "> pip install ibmdbpy[jdbc]\n",
    "\n",
    "__About naive Bayes classification__\n",
    "\n",
    "The Naive Bayes algorithm is a type of probabilistic ML classification algorithm. It is based on the Bayes theorem and on the assumption that predictors are independent. In practice, this is not really the case but naive Bayes classifiers perform quite well on tasks such as sentiment analysis, spam filtering and product recommendation.\n",
    "More on the [IBM Knowledge Center](https://www.ibm.com/support/knowledgecenter/SSCJDQ/com.ibm.swg.im.dashdb.analytics.doc/doc/r_naive_bayes.html).\n",
    "\n",
    "__Contents__\n",
    "\n",
    "This notebook shows you how to perform naive Bayes classification with ibmdbpy on the iris data sample.\n",
    "\n",
    "__1. Get started__\n",
    "* Imports\n",
    "* Establish a JDBC connection to Db2\n",
    "* Load sample data\n",
    "\n",
    "__2. Perform Naive Bayes Classification with ibmdbpy__\n",
    "* Define an indexer\n",
    "* Build the model\n",
    "* Analyze the results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Get started"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Imports__\n",
    "\n",
    "As usual we import some useful modules from ibmdbpy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ibmdbpy import IdaDataBase, IdaDataFrame\n",
    "from ibmdbpy import learn\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Establish JDBC connection to Db2 Warehouse__\n",
    "\n",
    "Please enter your credentials in order to connect to the database."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Enter the values for you database connection\n",
    "dsn_database = \"___\" # e.g. \"BLUDB\"\n",
    "dsn_hostname = \"___\" # e.g.: \"abc.url.example\"\n",
    "dsn_port = \"___\"    # e.g. \"50000\"\n",
    "dsn_uid = \"___\"     # e.g. \"db2_1234\"\n",
    "dsn_pwd = \"___\"     # e.g. \"zorglub\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Establish connection\n",
    "connection_string='jdbc:db2://'+dsn_hostname+':'+dsn_port+'/'+dsn_database+':user='+dsn_uid+';password='+dsn_pwd+\";\" \n",
    "idadb=IdaDataBase(dsn=connection_string, verbose = False)\n",
    "# set verbose to True if you want to see the detail of ibmdbpy operations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default the environment variable `AUTOCOMMIT` is then set to True, which means that every SQL statement which is submitted through the connection is executed within its own transaction and then committed implicitly. When you close the connection to Db2, if the environment variable `AUTOCOMMIT` is set to False, then all changes after the last explicit commit are discarded. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Load sample data__\n",
    "\n",
    "In this notebook, we simply use the well-known iris dataset, which is available out of the box in ibmdbpy. The dataset is first loaded as a Pandas DataFrame and then written into Db2 as a new table called IRIS. We define an IdaDataFrame with name idadf as a reference to this physical table. Finally, we add an ID column to index the rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Uploading 150 rows (maxnrow was set to 20000)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4.3</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.1</td>\n",
       "      <td>0.1</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.4</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.4</td>\n",
       "      <td>2.9</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.4</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.5</td>\n",
       "      <td>2.3</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.3</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width  petal_length  petal_width species\n",
       "0           4.3          3.0           1.1          0.1  setosa\n",
       "1           4.4          3.0           1.3          0.2  setosa\n",
       "2           4.4          2.9           1.4          0.2  setosa\n",
       "3           4.4          3.2           1.3          0.2  setosa\n",
       "4           4.5          2.3           1.3          0.3  setosa"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Fetch sample data\n",
    "from ibmdbpy.sampledata import iris\n",
    "\n",
    "# Convert the pandas DataFrame into an IdaDataFrame\n",
    "idadf = idadb.as_idadataframe(iris, \"IRIS\", clear_existing=True)\n",
    "\n",
    "idadf.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Perform Naive Bayes Classification with ibmdbpy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Define an indexer__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You need an indexer to be able to fit the model on the dataset! We have defined the indexer with `add_column_id` in the previous cell. Alternatively, if an eligible column already exists, you can set it as indexer as follows:\n",
    "\n",
    "> idadf = idadb.as_idadataframe(df, indexer = \"col_name\")\n",
    "\n",
    "or, if the table already exists, define the IdaDataFrame by directly pointing at this table:\n",
    "\n",
    "> idadf = IdaDataFrame(tablename, indexer = \"col_name\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add an indexer column\n",
    "# destructive = True ensures that the underlying table is modified too\n",
    "# Otherwise, an ID column is only added to the view.\n",
    "# Of course, physically modifying the table takes more time.\n",
    "idadb.add_column_id(idadf, destructive = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ID\n"
     ]
    }
   ],
   "source": [
    "print(idadf.indexer)\n",
    "# sanity check"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID  sepal_length  sepal_width  petal_length  petal_width species\n",
       "0   0           5.1          3.5           1.4          0.2  setosa\n",
       "1   1           4.9          3.0           1.4          0.2  setosa\n",
       "2   2           4.7          3.2           1.3          0.2  setosa\n",
       "3   3           4.6          3.1           1.5          0.2  setosa\n",
       "4   4           5.0          3.6           1.4          0.2  setosa"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "idadf.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Build the model__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the model\n",
    "NaiveBayesModel = learn.naive_bayes.NaiveBayes(modelname=\"NAIVE_IRIS\", disc = 'ef', bins = 10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_column_id', '_idadb', '_idadf', '_retrieve_NaiveBayes_Model', 'bins', 'coldefrole', 'coldeftype', 'colpropertiestable', 'describe', 'disc', 'fit', 'fit_predict', 'get_params', 'incolumn', 'labels_', 'mestimation', 'modelname', 'outtable', 'outtableProb', 'predict', 'set_params', 'target']\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'modelname': 'NAIVE_IRIS',\n",
       " 'disc': 'ef',\n",
       " 'bins': 10,\n",
       " 'target': None,\n",
       " 'incolumn': None,\n",
       " 'coldeftype': None,\n",
       " 'coldefrole': None,\n",
       " 'colpropertiestable': None,\n",
       " 'outtable': None,\n",
       " 'outtableProb': None,\n",
       " 'mestimation': None}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Inspect the model\n",
    "NaiveBayesModel.get_params()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All of the above parameters are optional. \n",
    "\n",
    "* modelname: If not specified, will be automatically generated. If a model with the same name already exists, it will be replaced.\n",
    "\n",
    "* disc: discretization type. Options are: Equal-frequency (ef), Minimal entropy (em), Equal-width (ew), Equal-width discretization with nice bucket limits (ewn).\n",
    "\n",
    "* bins: int, number of buckets for discretization of numeric columns\n",
    "\n",
    "* target: str, column of the input table that represents the class\n",
    "\n",
    "* incolumn: list, columns of the input table that have specific properties. If None, then all columns have default properties. Properties can be about the type (continuous or nominal) or the role(id, target, to ignore, input).\n",
    "\n",
    "More details on the [ibmdbpy documentation](https://pythonhosted.org/ibmdbpy/naive_bayes.html) and the [IBM Knowledge Center](https://www.ibm.com/support/knowledgecenter/SSCJDQ/com.ibm.swg.im.dashdb.analytics.doc/doc/r_naive_bayes.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fit the model to your dataset\n",
    "NaiveBayesModel.fit(idadf, column_id=\"ID\", target=\"species\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "pred = NaiveBayesModel.predict(idadf, column_id = 'ID')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>CLASS</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID   CLASS\n",
       "0   0  setosa\n",
       "1   1  setosa\n",
       "2   2  setosa\n",
       "3   3  setosa\n",
       "4   4  setosa"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pred.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Analyze the results__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>species</th>\n",
       "      <th>setosa</th>\n",
       "      <th>versicolor</th>\n",
       "      <th>virginica</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A-priori probabilities</td>\n",
       "      <td>0.33333</td>\n",
       "      <td>0.33333</td>\n",
       "      <td>0.33333</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  species   setosa  versicolor  virginica\n",
       "0  A-priori probabilities  0.33333     0.33333    0.33333"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "NaiveBayesModel.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>true_label</th>\n",
       "      <th>setosa</th>\n",
       "      <th>versicolor</th>\n",
       "      <th>virginica</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pred_label</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>setosa</th>\n",
       "      <td>50</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>versicolor</th>\n",
       "      <td>0</td>\n",
       "      <td>48</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>virginica</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>45</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "true_label  setosa  versicolor  virginica\n",
       "pred_label                               \n",
       "setosa          50           0          0\n",
       "versicolor       0          48          5\n",
       "virginica        0           2         45"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# confusion matrix\n",
    "iris_df = idadf.as_dataframe().set_index('ID')\n",
    "iris_df.rename(columns = {'species':'true_label'}, inplace = True)\n",
    "pred_df = pred.as_dataframe().set_index('ID')\n",
    "pred_df.rename(columns = {'CLASS':'pred_label'}, inplace = True)\n",
    "\n",
    "pd.crosstab(pred_df['pred_label'], iris_df['true_label'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MODEL\n",
      "       ATTRIBUTE VAL       CLASS  CLASSVALCOUNT  ATTRCLASSCOUNT  CLASSCOUNT  \\\n",
      "0   petal_length   1      setosa             50              50          50   \n",
      "1   petal_length   2      setosa              0              50          50   \n",
      "2   petal_length   3      setosa              0              50          50   \n",
      "3   petal_length   3  versicolor              2              50          50   \n",
      "4   petal_length   2  versicolor             48              50          50   \n",
      "5   petal_length   1  versicolor              0              50          50   \n",
      "6   petal_length   1   virginica              0              50          50   \n",
      "7   petal_length   2   virginica              6              50          50   \n",
      "8   petal_length   3   virginica             44              50          50   \n",
      "9    petal_width   1      setosa             41              50          50   \n",
      "10   petal_width   2      setosa              9              50          50   \n",
      "11   petal_width   4      setosa              0              50          50   \n",
      "12   petal_width   3      setosa              0              50          50   \n",
      "13   petal_width   1  versicolor              0              50          50   \n",
      "14   petal_width   4  versicolor              0              50          50   \n",
      "15   petal_width   2  versicolor             28              50          50   \n",
      "16   petal_width   3  versicolor             22              50          50   \n",
      "17   petal_width   2   virginica              0              50          50   \n",
      "18   petal_width   3   virginica             16              50          50   \n",
      "19   petal_width   1   virginica              0              50          50   \n",
      "20   petal_width   4   virginica             34              50          50   \n",
      "21  sepal_length   2      setosa             14              50          50   \n",
      "22  sepal_length   3      setosa              0              50          50   \n",
      "23  sepal_length   4      setosa              0              50          50   \n",
      "24  sepal_length   1      setosa             36              50          50   \n",
      "25  sepal_length   4  versicolor              9              50          50   \n",
      "26  sepal_length   1  versicolor              4              50          50   \n",
      "27  sepal_length   3  versicolor             17              50          50   \n",
      "28  sepal_length   2  versicolor             20              50          50   \n",
      "29  sepal_length   2   virginica              5              50          50   \n",
      "30  sepal_length   1   virginica              1              50          50   \n",
      "31  sepal_length   4   virginica             26              50          50   \n",
      "32  sepal_length   3   virginica             18              50          50   \n",
      "33   sepal_width   1      setosa             50              50          50   \n",
      "34   sepal_width   1  versicolor             50              50          50   \n",
      "35   sepal_width   1   virginica             50              50          50   \n",
      "\n",
      "    TOTALCOUNT  \n",
      "0          150  \n",
      "1          150  \n",
      "2          150  \n",
      "3          150  \n",
      "4          150  \n",
      "5          150  \n",
      "6          150  \n",
      "7          150  \n",
      "8          150  \n",
      "9          150  \n",
      "10         150  \n",
      "11         150  \n",
      "12         150  \n",
      "13         150  \n",
      "14         150  \n",
      "15         150  \n",
      "16         150  \n",
      "17         150  \n",
      "18         150  \n",
      "19         150  \n",
      "20         150  \n",
      "21         150  \n",
      "22         150  \n",
      "23         150  \n",
      "24         150  \n",
      "25         150  \n",
      "26         150  \n",
      "27         150  \n",
      "28         150  \n",
      "29         150  \n",
      "30         150  \n",
      "31         150  \n",
      "32         150  \n",
      "33         150  \n",
      "34         150  \n",
      "35         150  \n",
      "DISCRANGES\n",
      "        COLNAME  BREAK\n",
      "0  petal_length    1.9\n",
      "1  petal_length    4.9\n",
      "2   petal_width    0.3\n",
      "3   petal_width    1.3\n",
      "4   petal_width    1.8\n",
      "5  sepal_length    5.1\n",
      "6  sepal_length    5.8\n",
      "7  sepal_length    6.4\n",
      "8   sepal_width    NaN\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>species</th>\n",
       "      <th>setosa</th>\n",
       "      <th>versicolor</th>\n",
       "      <th>virginica</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A-priori probabilities</td>\n",
       "      <td>0.33333</td>\n",
       "      <td>0.33333</td>\n",
       "      <td>0.33333</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  species   setosa  versicolor  virginica\n",
       "0  A-priori probabilities  0.33333     0.33333    0.33333"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# If you want more details:\n",
    "NaiveBayesModel.describe(detail = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. `MODEL` table\n",
    "\n",
    "The first table `MODEL` can be read as follows:\n",
    "\n",
    "\"In bucket VAL of column ATTRIBUTE, there are CLASSVALCOUNT items from CLASS out of CLASSCOUNT items from this class and ATTRCLASSCOUNT items of this class having this attribute.\"\n",
    "\n",
    "Examples: \n",
    "\n",
    "* \"In bucket 1 of column petal_length, there are 50 items from class setosa, out of 50 items from this class and 50 items from class setosa having this attribute.\"\n",
    "\n",
    "* \"In bucket 2 of column petal_length, there isn't any item from class setosa, out of 50 items from this class and 50 items having this attribute.\"\n",
    "\n",
    "* \"In bucket 3 of column petal_length, there are 44 items from class virginica, out of 50 items from this class and 50 items having this attribute.\" \n",
    "\n",
    "\n",
    "2. `TOTALCOUNT` table\n",
    "\n",
    "Reminder: the iris dataset has 50 items from each of the 3 classes, so 150 in total. For each of the 36 (ATTRIBUTE, VAL, CLASS) triplet, the total number of items that fits in the buckets is given (not very insightful for a perfectly clean dataset like ours, but can be useful).\n",
    "\n",
    "3. `DISCRANGE` table\n",
    "\n",
    "The `DISCRANGE` table gives the borders of the discretization buckets for each feature. Note that the buckets are not listed in a specific order. For the `petal_length` attribute, we have 2 `BREAK` values, hence 3 buckets. Same for the other features. By default, numerical values are dicretized into 10 buckets or intervals for each class, id est each feature gets 9 breaking values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Congratulations!__ You are ready to perform naive Bayes classification with ibmdbpy!\n",
    "\n",
    "\n",
    "## Where to go from here ?\n",
    "\n",
    "More examples of ibmdbpy capabilities and ML algorithms are available on this GitHub repository.\n",
    "\n",
    "* Getting started with ibmdbpy :\n",
    "        \n",
    "    [Basics](../GettingStarted/ibmdbpy_GettingStarted_1-basics.ipynb)\n",
    "\n",
    "    [GeoBasics](../GettingStarted/ibmdbpy_GettingStarted_2-geo_basics.ipynb)\n",
    "\n",
    "    [Extensive Guide](../GettingStarted/ibmdbpy_GettingStarted_3-geo_guide.ipynb)\n",
    "    \n",
    "\n",
    "* More practice : analyze the Museums dataset, understand how to create IdaDataFrames and IdaGeoDataFrames:\n",
    "        \n",
    "    [Preprocessing](../MuseumsUseCase/ibmdbpy_Museums_DataAnalysis_1-preprocessing.ipynb)\n",
    "\n",
    "    [Geospatial recommendation](../MuseumsUseCase/ibmdbpy_Museums_DataAnalysis_2-geospatial.ipynb)\n",
    "\n",
    "\n",
    "* Machine learning with ibmdbpy: \n",
    "        \n",
    "    [Association Rules Mining](../MachineLearning/ibmdbpy_AssociationRulesMining.ipynb)\n",
    "\n",
    "    [KMeans](../GettingStarted/ibmdbpy_GettingStarted_1-basics.ipynb) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_____\n",
    "__Author__\n",
    "\n",
    "Eva Feillet - ML intern, IBM Cloud and Cognitive Software @ IBM Research & Development Germany, Böblingen lab."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}