{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning with ibmdbpy and Db2 Warehouse: Naïve Bayes Classification\n", "\n", "This notebook explains how to perform Naive Bayes classification with ibmdbpy Python package and data stored in Db2 Warehouse.\n", "___\n", "\n", "### Accelerate Python analytics with in-database processing by using ibmdbpy and IBM Db2 Warehouse\n", " \n", "The ibmdbpy project provides a Python interface for data manipulation and access to in-database algorithms in IBM Db2 Warehouse. It accelerates Python analytics by seamlessly pushing operations written in Python into the underlying database for execution, thereby benefitting from in-database performance-enhancing features, such as columnar storage and parallel processing. For more details about ibmdbpy, please refer to the [documentation](https://pythonhosted.org/ibmdbpy/index.html) and to the dedicated [Git repository](https://github.com/ibmdbanalytics/ibmdbpy/tree/master/ibmdbpy). This notebook provides you with an overview of ibmdbpy functionalities. \n", "\n", " \n", "__Prerequisites__\n", "* Db2 account: see [IBM Cloud](https://cloud.ibm.com/login) or [Db2 Warehouse](https://www.ibm.com/support/knowledgecenter/en/SSCJDQ/com.ibm.swg.im.dashdb.kc.doc/welcome.html)\n", "* Db2 driver: learn more on [IBM Knowledge Center](https://www.ibm.com/support/knowledgecenter/en/SSFMBX/com.ibm.swg.im.dashdb.doc/connecting/connect_applications_by_type.html) and see [IBM Support](https://www.ibm.com/support/pages/db2-jdbc-driver-versions-and-downloads)\n", "* Having installed the [ibmdbpy](https://pypi.org/project/ibmdbpy/) python library with pip: \n", "> pip install ibmdbpy \n", "* Optional dependency for JDBC is the [jaydebeapi](https://pypi.org/project/JayDeBeApi/) library. Run the following command to install ibmdbpy, as well as the dependencies for the JDBC feature:\n", "> pip install ibmdbpy[jdbc]\n", "\n", "__About naive Bayes classification__\n", "\n", "The Naive Bayes algorithm is a type of probabilistic ML classification algorithm. It is based on the Bayes theorem and on the assumption that predictors are independent. In practice, this is not really the case but naive Bayes classifiers perform quite well on tasks such as sentiment analysis, spam filtering and product recommendation.\n", "More on the [IBM Knowledge Center](https://www.ibm.com/support/knowledgecenter/SSCJDQ/com.ibm.swg.im.dashdb.analytics.doc/doc/r_naive_bayes.html).\n", "\n", "__Contents__\n", "\n", "This notebook shows you how to perform naive Bayes classification with ibmdbpy on the iris data sample.\n", "\n", "__1. Get started__\n", "* Imports\n", "* Establish a JDBC connection to Db2\n", "* Load sample data\n", "\n", "__2. Perform Naive Bayes Classification with ibmdbpy__\n", "* Define an indexer\n", "* Build the model\n", "* Analyze the results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Get started" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Imports__\n", "\n", "As usual we import some useful modules from ibmdbpy." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from ibmdbpy import IdaDataBase, IdaDataFrame\n", "from ibmdbpy import learn\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Establish JDBC connection to Db2 Warehouse__\n", "\n", "Please enter your credentials in order to connect to the database." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#Enter the values for you database connection\n", "dsn_database = \"___\" # e.g. \"BLUDB\"\n", "dsn_hostname = \"___\" # e.g.: \"abc.url.example\"\n", "dsn_port = \"___\" # e.g. \"50000\"\n", "dsn_uid = \"___\" # e.g. \"db2_1234\"\n", "dsn_pwd = \"___\" # e.g. \"zorglub\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#Establish connection\n", "connection_string='jdbc:db2://'+dsn_hostname+':'+dsn_port+'/'+dsn_database+':user='+dsn_uid+';password='+dsn_pwd+\";\" \n", "idadb=IdaDataBase(dsn=connection_string, verbose = False)\n", "# set verbose to True if you want to see the detail of ibmdbpy operations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default the environment variable `AUTOCOMMIT` is then set to True, which means that every SQL statement which is submitted through the connection is executed within its own transaction and then committed implicitly. When you close the connection to Db2, if the environment variable `AUTOCOMMIT` is set to False, then all changes after the last explicit commit are discarded. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Load sample data__\n", "\n", "In this notebook, we simply use the well-known iris dataset, which is available out of the box in ibmdbpy. The dataset is first loaded as a Pandas DataFrame and then written into Db2 as a new table called IRIS. We define an IdaDataFrame with name idadf as a reference to this physical table. Finally, we add an ID column to index the rows." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Uploading 150 rows (maxnrow was set to 20000)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
04.33.01.10.1setosa
14.43.01.30.2setosa
24.42.91.40.2setosa
34.43.21.30.2setosa
44.52.31.30.3setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 4.3 3.0 1.1 0.1 setosa\n", "1 4.4 3.0 1.3 0.2 setosa\n", "2 4.4 2.9 1.4 0.2 setosa\n", "3 4.4 3.2 1.3 0.2 setosa\n", "4 4.5 2.3 1.3 0.3 setosa" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fetch sample data\n", "from ibmdbpy.sampledata import iris\n", "\n", "# Convert the pandas DataFrame into an IdaDataFrame\n", "idadf = idadb.as_idadataframe(iris, \"IRIS\", clear_existing=True)\n", "\n", "idadf.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Perform Naive Bayes Classification with ibmdbpy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Define an indexer__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You need an indexer to be able to fit the model on the dataset! We have defined the indexer with `add_column_id` in the previous cell. Alternatively, if an eligible column already exists, you can set it as indexer as follows:\n", "\n", "> idadf = idadb.as_idadataframe(df, indexer = \"col_name\")\n", "\n", "or, if the table already exists, define the IdaDataFrame by directly pointing at this table:\n", "\n", "> idadf = IdaDataFrame(tablename, indexer = \"col_name\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Add an indexer column\n", "# destructive = True ensures that the underlying table is modified too\n", "# Otherwise, an ID column is only added to the view.\n", "# Of course, physically modifying the table takes more time.\n", "idadb.add_column_id(idadf, destructive = True)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ID\n" ] } ], "source": [ "print(idadf.indexer)\n", "# sanity check" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDsepal_lengthsepal_widthpetal_lengthpetal_widthspecies
005.13.51.40.2setosa
114.93.01.40.2setosa
224.73.21.30.2setosa
334.63.11.50.2setosa
445.03.61.40.2setosa
\n", "
" ], "text/plain": [ " ID sepal_length sepal_width petal_length petal_width species\n", "0 0 5.1 3.5 1.4 0.2 setosa\n", "1 1 4.9 3.0 1.4 0.2 setosa\n", "2 2 4.7 3.2 1.3 0.2 setosa\n", "3 3 4.6 3.1 1.5 0.2 setosa\n", "4 4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "idadf.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Build the model__" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Define the model\n", "NaiveBayesModel = learn.naive_bayes.NaiveBayes(modelname=\"NAIVE_IRIS\", disc = 'ef', bins = 10)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_column_id', '_idadb', '_idadf', '_retrieve_NaiveBayes_Model', 'bins', 'coldefrole', 'coldeftype', 'colpropertiestable', 'describe', 'disc', 'fit', 'fit_predict', 'get_params', 'incolumn', 'labels_', 'mestimation', 'modelname', 'outtable', 'outtableProb', 'predict', 'set_params', 'target']\n" ] }, { "data": { "text/plain": [ "{'modelname': 'NAIVE_IRIS',\n", " 'disc': 'ef',\n", " 'bins': 10,\n", " 'target': None,\n", " 'incolumn': None,\n", " 'coldeftype': None,\n", " 'coldefrole': None,\n", " 'colpropertiestable': None,\n", " 'outtable': None,\n", " 'outtableProb': None,\n", " 'mestimation': None}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Inspect the model\n", "NaiveBayesModel.get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of the above parameters are optional. \n", "\n", "* modelname: If not specified, will be automatically generated. If a model with the same name already exists, it will be replaced.\n", "\n", "* disc: discretization type. Options are: Equal-frequency (ef), Minimal entropy (em), Equal-width (ew), Equal-width discretization with nice bucket limits (ewn).\n", "\n", "* bins: int, number of buckets for discretization of numeric columns\n", "\n", "* target: str, column of the input table that represents the class\n", "\n", "* incolumn: list, columns of the input table that have specific properties. If None, then all columns have default properties. Properties can be about the type (continuous or nominal) or the role(id, target, to ignore, input).\n", "\n", "More details on the [ibmdbpy documentation](https://pythonhosted.org/ibmdbpy/naive_bayes.html) and the [IBM Knowledge Center](https://www.ibm.com/support/knowledgecenter/SSCJDQ/com.ibm.swg.im.dashdb.analytics.doc/doc/r_naive_bayes.html)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Fit the model to your dataset\n", "NaiveBayesModel.fit(idadf, column_id=\"ID\", target=\"species\")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "pred = NaiveBayesModel.predict(idadf, column_id = 'ID')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDCLASS
00setosa
11setosa
22setosa
33setosa
44setosa
\n", "
" ], "text/plain": [ " ID CLASS\n", "0 0 setosa\n", "1 1 setosa\n", "2 2 setosa\n", "3 3 setosa\n", "4 4 setosa" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pred.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Analyze the results__" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessetosaversicolorvirginica
0A-priori probabilities0.333330.333330.33333
\n", "
" ], "text/plain": [ " species setosa versicolor virginica\n", "0 A-priori probabilities 0.33333 0.33333 0.33333" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "NaiveBayesModel.describe()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
true_labelsetosaversicolorvirginica
pred_label
setosa5000
versicolor0485
virginica0245
\n", "
" ], "text/plain": [ "true_label setosa versicolor virginica\n", "pred_label \n", "setosa 50 0 0\n", "versicolor 0 48 5\n", "virginica 0 2 45" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# confusion matrix\n", "iris_df = idadf.as_dataframe().set_index('ID')\n", "iris_df.rename(columns = {'species':'true_label'}, inplace = True)\n", "pred_df = pred.as_dataframe().set_index('ID')\n", "pred_df.rename(columns = {'CLASS':'pred_label'}, inplace = True)\n", "\n", "pd.crosstab(pred_df['pred_label'], iris_df['true_label'])" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MODEL\n", " ATTRIBUTE VAL CLASS CLASSVALCOUNT ATTRCLASSCOUNT CLASSCOUNT \\\n", "0 petal_length 1 setosa 50 50 50 \n", "1 petal_length 2 setosa 0 50 50 \n", "2 petal_length 3 setosa 0 50 50 \n", "3 petal_length 3 versicolor 2 50 50 \n", "4 petal_length 2 versicolor 48 50 50 \n", "5 petal_length 1 versicolor 0 50 50 \n", "6 petal_length 1 virginica 0 50 50 \n", "7 petal_length 2 virginica 6 50 50 \n", "8 petal_length 3 virginica 44 50 50 \n", "9 petal_width 1 setosa 41 50 50 \n", "10 petal_width 2 setosa 9 50 50 \n", "11 petal_width 4 setosa 0 50 50 \n", "12 petal_width 3 setosa 0 50 50 \n", "13 petal_width 1 versicolor 0 50 50 \n", "14 petal_width 4 versicolor 0 50 50 \n", "15 petal_width 2 versicolor 28 50 50 \n", "16 petal_width 3 versicolor 22 50 50 \n", "17 petal_width 2 virginica 0 50 50 \n", "18 petal_width 3 virginica 16 50 50 \n", "19 petal_width 1 virginica 0 50 50 \n", "20 petal_width 4 virginica 34 50 50 \n", "21 sepal_length 2 setosa 14 50 50 \n", "22 sepal_length 3 setosa 0 50 50 \n", "23 sepal_length 4 setosa 0 50 50 \n", "24 sepal_length 1 setosa 36 50 50 \n", "25 sepal_length 4 versicolor 9 50 50 \n", "26 sepal_length 1 versicolor 4 50 50 \n", "27 sepal_length 3 versicolor 17 50 50 \n", "28 sepal_length 2 versicolor 20 50 50 \n", "29 sepal_length 2 virginica 5 50 50 \n", "30 sepal_length 1 virginica 1 50 50 \n", "31 sepal_length 4 virginica 26 50 50 \n", "32 sepal_length 3 virginica 18 50 50 \n", "33 sepal_width 1 setosa 50 50 50 \n", "34 sepal_width 1 versicolor 50 50 50 \n", "35 sepal_width 1 virginica 50 50 50 \n", "\n", " TOTALCOUNT \n", "0 150 \n", "1 150 \n", "2 150 \n", "3 150 \n", "4 150 \n", "5 150 \n", "6 150 \n", "7 150 \n", "8 150 \n", "9 150 \n", "10 150 \n", "11 150 \n", "12 150 \n", "13 150 \n", "14 150 \n", "15 150 \n", "16 150 \n", "17 150 \n", "18 150 \n", "19 150 \n", "20 150 \n", "21 150 \n", "22 150 \n", "23 150 \n", "24 150 \n", "25 150 \n", "26 150 \n", "27 150 \n", "28 150 \n", "29 150 \n", "30 150 \n", "31 150 \n", "32 150 \n", "33 150 \n", "34 150 \n", "35 150 \n", "DISCRANGES\n", " COLNAME BREAK\n", "0 petal_length 1.9\n", "1 petal_length 4.9\n", "2 petal_width 0.3\n", "3 petal_width 1.3\n", "4 petal_width 1.8\n", "5 sepal_length 5.1\n", "6 sepal_length 5.8\n", "7 sepal_length 6.4\n", "8 sepal_width NaN\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessetosaversicolorvirginica
0A-priori probabilities0.333330.333330.33333
\n", "
" ], "text/plain": [ " species setosa versicolor virginica\n", "0 A-priori probabilities 0.33333 0.33333 0.33333" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# If you want more details:\n", "NaiveBayesModel.describe(detail = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. `MODEL` table\n", "\n", "The first table `MODEL` can be read as follows:\n", "\n", "\"In bucket VAL of column ATTRIBUTE, there are CLASSVALCOUNT items from CLASS out of CLASSCOUNT items from this class and ATTRCLASSCOUNT items of this class having this attribute.\"\n", "\n", "Examples: \n", "\n", "* \"In bucket 1 of column petal_length, there are 50 items from class setosa, out of 50 items from this class and 50 items from class setosa having this attribute.\"\n", "\n", "* \"In bucket 2 of column petal_length, there isn't any item from class setosa, out of 50 items from this class and 50 items having this attribute.\"\n", "\n", "* \"In bucket 3 of column petal_length, there are 44 items from class virginica, out of 50 items from this class and 50 items having this attribute.\" \n", "\n", "\n", "2. `TOTALCOUNT` table\n", "\n", "Reminder: the iris dataset has 50 items from each of the 3 classes, so 150 in total. For each of the 36 (ATTRIBUTE, VAL, CLASS) triplet, the total number of items that fits in the buckets is given (not very insightful for a perfectly clean dataset like ours, but can be useful).\n", "\n", "3. `DISCRANGE` table\n", "\n", "The `DISCRANGE` table gives the borders of the discretization buckets for each feature. Note that the buckets are not listed in a specific order. For the `petal_length` attribute, we have 2 `BREAK` values, hence 3 buckets. Same for the other features. By default, numerical values are dicretized into 10 buckets or intervals for each class, id est each feature gets 9 breaking values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Congratulations!__ You are ready to perform naive Bayes classification with ibmdbpy!\n", "\n", "\n", "## Where to go from here ?\n", "\n", "More examples of ibmdbpy capabilities and ML algorithms are available on this GitHub repository.\n", "\n", "* Getting started with ibmdbpy :\n", " \n", " [Basics](../GettingStarted/ibmdbpy_GettingStarted_1-basics.ipynb)\n", "\n", " [GeoBasics](../GettingStarted/ibmdbpy_GettingStarted_2-geo_basics.ipynb)\n", "\n", " [Extensive Guide](../GettingStarted/ibmdbpy_GettingStarted_3-geo_guide.ipynb)\n", " \n", "\n", "* More practice : analyze the Museums dataset, understand how to create IdaDataFrames and IdaGeoDataFrames:\n", " \n", " [Preprocessing](../MuseumsUseCase/ibmdbpy_Museums_DataAnalysis_1-preprocessing.ipynb)\n", "\n", " [Geospatial recommendation](../MuseumsUseCase/ibmdbpy_Museums_DataAnalysis_2-geospatial.ipynb)\n", "\n", "\n", "* Machine learning with ibmdbpy: \n", " \n", " [Association Rules Mining](../MachineLearning/ibmdbpy_AssociationRulesMining.ipynb)\n", "\n", " [KMeans](../GettingStarted/ibmdbpy_GettingStarted_1-basics.ipynb) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_____\n", "__Author__\n", "\n", "Eva Feillet - ML intern, IBM Cloud and Cognitive Software @ IBM Research & Development Germany, Böblingen lab." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }