{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting started with ibmdbpy - Part 1: Basics\n",
"\n",
"This notebook showcases ibmdbpy's key abstractions and features. It provides you with step-by-step examples to get started with the package.\n",
"\n",
"___\n",
"\n",
"### Accelerate Python analytics with in-database processing by using ibmdbpy and IBM Db2 Warehouse\n",
" \n",
"The ibmdbpy project provides access to in-database algorithms in IBM Db2 Warehouse through a Python interface for data manipulation. It accelerates Python analytics by seamlessly pushing operations written in Python into the underlying database for execution, thereby benefitting from in-database performance-enhancing features, such as columnar storage and parallel processing. For more details about ibmdbpy, please refer to the [documentation](https://pythonhosted.org/ibmdbpy/index.html) and to the dedicated [Git repository](https://github.com/ibmdbanalytics/ibmdbpy/tree/master/ibmdbpy). This notebook provides you with an overview of ibmdbpy functionalities. \n",
"\n",
" \n",
"__Prerequisites__\n",
"* Db2 account: see [IBM Cloud](https://cloud.ibm.com/login) or [Db2 Warehouse](https://www.ibm.com/support/knowledgecenter/en/SSCJDQ/com.ibm.swg.im.dashdb.kc.doc/welcome.html)\n",
"* Db2 driver: learn more on [IBM Knowledge Center](https://www.ibm.com/support/knowledgecenter/en/SSFMBX/com.ibm.swg.im.dashdb.doc/connecting/connect_applications_by_type.html) and see [IBM Support](https://www.ibm.com/support/pages/db2-jdbc-driver-versions-and-downloads)\n",
"* Having installed the [ibmdbpy](https://pypi.org/project/ibmdbpy/) python library with pip: \n",
"> pip install ibmdbpy \n",
"* Optional dependency for JDBC is the [jaydebeapi](https://pypi.org/project/JayDeBeApi/) library. Run the following command to install ibmdbpy, as well as the dependencies for the JDBC feature:\n",
"> pip install ibmdbpy[jdbc]\n",
"\n",
"__Contents__\n",
"\n",
"1. Establish connection to Db2 database\n",
"2. Manipulate IdaDataFrames\n",
"3. Apply Machine Learning techniques with ibmdbpy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Imports__"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from ibmdbpy import IdaDataBase, IdaDataFrame\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Establish connection to Db2 database\n",
"\n",
"Two options are available to connect to a remote Db2 instance: ODBC and JDBC.\n",
"ODBC (Open Database Connectivity) is a standard API for accessing a database, independently of DBMS and OS.\n",
"JDBC (Java Database Connectivity) is an API for Java which defines how a client may access a database.\n",
"The JDBC Connection is based on a Java Virtual Machine, so it is available on every machine that can run Java. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__ODBC__\n",
"\n",
"Download an IBM DB2 driver and set up an ODBC connection, including the connection protocol, the port, and the host name, before establishing the connection. You find detailed information on the IBM Knowledge Center:\n",
"\n",
"* ODBC for [IBM Db2 Warehouse](https://www.ibm.com/support/knowledgecenter/en/SSCJDQ/com.ibm.swg.im.dashdb.doc/connecting/connect_applications_by_type.html)\n",
"* ODBC for [IBM Db2 on Cloud](https://www.ibm.com/support/knowledgecenter/en/SSFMBX/com.ibm.swg.im.dashdb.doc/connecting/connect_applications_by_type.html) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__JDBC__\n",
"\n",
"This is how we will connect in this notebook. First, you need to dowload a valid driver (more info on [IBM Support](https://www.ibm.com/support/pages/db2-jdbc-driver-versions-and-downloads)). Then you need to put the `db2jcc.jar` or `db2jcc4.jar` file in the ibmdbpy site-package folder. When ibmdbpy runs, it checks whether one of those files exists in its installation folder and uses it to establish the connection. \n",
"\n",
"More details on IBM Knowledge Center:\n",
"* JDBC for [IBM Db2 Warehouse](https://www.ibm.com/support/knowledgecenter/en/SSCJDQ/com.ibm.swg.im.dashdb.doc/connecting/connect_connecting_jdbc_applications.html)\n",
"* JDBC for [IBM Db2 on Cloud](https://www.ibm.com/support/knowledgecenter/en/SSFMBX/com.ibm.swg.im.dashdb.doc/connecting/connect_connecting_jdbc_applications.html)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"#Enter the values for you database connection\n",
"dsn_database = \"___\" # e.g. \"BLUDB\"\n",
"dsn_hostname = \"___\" # e.g.: \"abc.url.example\"\n",
"dsn_port = \"___\" # e.g. \"50000\"\n",
"dsn_uid = \"___\" # e.g. \"db2_1234\"\n",
"dsn_pwd = \"___\" # e.g. \"zorglub\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"connection_string='jdbc:db2://'+dsn_hostname+':'+dsn_port+'/'+dsn_database+':user='+dsn_uid+';password='+dsn_pwd+\";\" \n",
"# connect to the database, the idadb object holds the connection\n",
"idadb=IdaDataBase(dsn=connection_string)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Congratulations! You successfully connected to Db2 with ibmdbpy! When you are done, use `idadb.close()` to close the connection. To reconnect, or if the connection was broken, just use `idadb.reconnect()`. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Verbosity and autocommit__\n",
"\n",
"The verbose mode automatically prints all SQL-communication between ibmdbpy and Db2, which can be very useful for debugging or understanding how ibmdbpy works. Choose the mode with `set_verbose()` or by setting the `verbosity` option when defining the IdaDataBase object. We encourage you to take a look at the prints in the first place, then feel free to silence the verbose. Note that printing adds delay when running cells. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Verbosity\n",
"from ibmdbpy.utils import set_verbose\n",
"set_verbose(False) # set to True to see the detail of the operations performed by ibmdbpy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By default the environment variable `AUTOCOMMIT` is set to True, which means that every SQL statement which is submitted through the connection is executed within its own transaction and then committed implicitly. When you close the connection to Db2, if the environment variable `AUTOCOMMIT` is set to False, then all changes after the last explicit commit are discarded. \n",
"\n",
"Let's get to it!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Manipulate IdaDataFrames\n",
"\n",
"Let's explore ibmdbpy functionalities by using sample data. The well-known iris, swiss, and titanic datasets are furnished with the package."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Load sample data__\n",
"\n",
"In the following cells we first import the iris dataset as a pandas dataframe, then create an IdaDataFrame by creating a table `IRIS` into Db2 on the basis of this dataset. The option `clear_existing` is set to True, meaning that if a table called `IRIS` already exists in your database, it will be overwritten."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Import sample data\n",
"from ibmdbpy.sampledata import iris"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Uploading 150 rows (maxnrow was set to 20000)\n",
"Table written\n"
]
}
],
"source": [
"# Load the data into Db2, data is handled as an IdaDataFrame\n",
"# Note : clear_existing is set to True, which means that if a table named IRIS is already present in your database,\n",
"# it will we deleted so that a new one, defined below, is created.\n",
"\n",
"idadf = idadb.as_idadataframe(iris, \"IRIS\", clear_existing = True)\n",
"\n",
"print(\"Table written\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, if the table you want to manipulate has already been written into Db2, you can create an IdaDataFrame that directly points to this table. You need to specify the IdaDataBase object when creating an IdaDataFrame because this object holds the connection to the database."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# If table already exists in Db2\n",
"idadf_bis = IdaDataFrame(idadb, 'IRIS')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Explore the data__\n",
"\n",
"Ibmdbpy has been written in such a way that it is intuitive to use when you are already familiar with Pandas and Scikit-learn. For example you have the `head()` and `tail()` methods like in Pandas to take a look at the data. \n",
"\n",
"But what does ibmdbpy actually do?\n",
"Ibmdbpy translates Pandas-like syntax into SQL and uses a middleware API (e.g. JayDeBeApi) to send it to an ODBC or JDBC-connected database for execution. After fetching the results, printing them will look similar to printing a pandas.DataFrame or a pandas.Series. Here you can see that data is indeed displayed like a Pandas dataframe."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 4.3 | \n",
" 3.0 | \n",
" 1.1 | \n",
" 0.1 | \n",
" setosa | \n",
"
\n",
" \n",
" | 1 | \n",
" 4.4 | \n",
" 3.0 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 2 | \n",
" 4.4 | \n",
" 2.9 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 3 | \n",
" 4.4 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 4 | \n",
" 4.5 | \n",
" 2.3 | \n",
" 1.3 | \n",
" 0.3 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"0 4.3 3.0 1.1 0.1 setosa\n",
"1 4.4 3.0 1.3 0.2 setosa\n",
"2 4.4 2.9 1.4 0.2 setosa\n",
"3 4.4 3.2 1.3 0.2 setosa\n",
"4 4.5 2.3 1.3 0.3 setosa"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Take a look at the first records\n",
"idadf.head()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" | 145 | \n",
" 7.7 | \n",
" 3.0 | \n",
" 6.1 | \n",
" 2.3 | \n",
" virginica | \n",
"
\n",
" \n",
" | 146 | \n",
" 7.7 | \n",
" 2.6 | \n",
" 6.9 | \n",
" 2.3 | \n",
" virginica | \n",
"
\n",
" \n",
" | 147 | \n",
" 7.7 | \n",
" 3.8 | \n",
" 6.7 | \n",
" 2.2 | \n",
" virginica | \n",
"
\n",
" \n",
" | 148 | \n",
" 7.7 | \n",
" 2.8 | \n",
" 6.7 | \n",
" 2.0 | \n",
" virginica | \n",
"
\n",
" \n",
" | 149 | \n",
" 7.9 | \n",
" 3.8 | \n",
" 6.4 | \n",
" 2.0 | \n",
" virginica | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"145 7.7 3.0 6.1 2.3 virginica\n",
"146 7.7 2.6 6.9 2.3 virginica\n",
"147 7.7 3.8 6.7 2.2 virginica\n",
"148 7.7 2.8 6.7 2.0 virginica\n",
"149 7.9 3.8 6.4 2.0 virginica"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Take a look at the last records\n",
"idadf.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: You can see in the two previous cells that the head and tail functions sort the data by the first numerical column. This can be disabled by setting the option `sort` to False when calling them. Other functions do not automatically sort the data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Simple statistics__\n",
"\n",
"In this section you get a tour of the tools you use all the time for basic statistics."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(150, 5)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Display shape of the IDA dataframe\n",
"idadf.shape"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']\n"
]
}
],
"source": [
"# Print column names\n",
"print([name for name in idadf.columns])"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"# Correlation matrix: Here you can see that the output is a Pandas Dataframe.\n",
"corr = idadf.corr()\n",
"print(type(corr))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
"
\n",
" \n",
" \n",
" \n",
" | sepal_length | \n",
" 1.000000 | \n",
" -0.117570 | \n",
" 0.871754 | \n",
" 0.817941 | \n",
"
\n",
" \n",
" | sepal_width | \n",
" -0.117570 | \n",
" 1.000000 | \n",
" -0.428440 | \n",
" -0.366126 | \n",
"
\n",
" \n",
" | petal_length | \n",
" 0.871754 | \n",
" -0.428440 | \n",
" 1.000000 | \n",
" 0.962865 | \n",
"
\n",
" \n",
" | petal_width | \n",
" 0.817941 | \n",
" -0.366126 | \n",
" 0.962865 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width\n",
"sepal_length 1.000000 -0.117570 0.871754 0.817941\n",
"sepal_width -0.117570 1.000000 -0.428440 -0.366126\n",
"petal_length 0.871754 -0.428440 1.000000 0.962865\n",
"petal_width 0.817941 -0.366126 0.962865 1.000000"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Display the correlation matrix:\n",
"corr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that only numerical features have been taken into account in the correlation matrix above, like in Pandas. The column containing species names does not appear above because it has a `VARCHAR` data type. Same for the covariance matrix."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TYPENAME | \n",
"
\n",
" \n",
" \n",
" \n",
" | sepal_length | \n",
" DOUBLE | \n",
"
\n",
" \n",
" | sepal_width | \n",
" DOUBLE | \n",
"
\n",
" \n",
" | petal_length | \n",
" DOUBLE | \n",
"
\n",
" \n",
" | petal_width | \n",
" DOUBLE | \n",
"
\n",
" \n",
" | species | \n",
" VARCHAR | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TYPENAME\n",
"sepal_length DOUBLE\n",
"sepal_width DOUBLE\n",
"petal_length DOUBLE\n",
"petal_width DOUBLE\n",
"species VARCHAR"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"idadf.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
"
\n",
" \n",
" \n",
" \n",
" | sepal_length | \n",
" 0.685694 | \n",
" -0.042434 | \n",
" 1.274315 | \n",
" 0.516271 | \n",
"
\n",
" \n",
" | sepal_width | \n",
" -0.042434 | \n",
" 0.189979 | \n",
" -0.329656 | \n",
" -0.121639 | \n",
"
\n",
" \n",
" | petal_length | \n",
" 1.274315 | \n",
" -0.329656 | \n",
" 3.116278 | \n",
" 1.295609 | \n",
"
\n",
" \n",
" | petal_width | \n",
" 0.516271 | \n",
" -0.121639 | \n",
" 1.295609 | \n",
" 0.581006 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width\n",
"sepal_length 0.685694 -0.042434 1.274315 0.516271\n",
"sepal_width -0.042434 0.189979 -0.329656 -0.121639\n",
"petal_length 1.274315 -0.329656 3.116278 1.295609\n",
"petal_width 0.516271 -0.121639 1.295609 0.581006"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Compute the covariance matrix\n",
"idadf.cov()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Data selection and projection__\n",
"\n",
"When working with an IdaDataFrame, you can use classic slicing like with NumPy arrays and Pandas DataFrames."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Note: Use of an indexer*\n",
"\n",
"An error would be raised by this line of code.\n",
"\n",
"> idadf_new1 = idadf[0:9] \n",
"\n",
"We want to select the first ten rows but as the error message would say: \"Row order is not guaranteed if no indexer was given and the dataset was not sorted\". \n",
"For ibmdbpy, using row indices only makes sense if an ID column is provided. Otherwise, the selection is non-deterministic. Here an index has not been provided so far, that is why an error would be produced. So let's use `add_column_id` to add an indexer column to the table. \n",
"\n",
"This way we really obtain a Pandas-like behavior and do some data selection and projection based on indices. If the `destructive` option is set to True, then this column will be physically added to the Db2 table and it will have a determinisic behavior. Otherwise, the physical table is not modified and each time you will print rows from the IdaDataFrame, row IDs will be assigned in a non-deterministic way. So we set `destructive = True` here."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"The idadataframe now has an indexer. Its name is ID.\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 2 | \n",
" 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 3 | \n",
" 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 4 | \n",
" 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID sepal_length sepal_width petal_length petal_width species\n",
"0 0 5.1 3.5 1.4 0.2 setosa\n",
"1 1 4.9 3.0 1.4 0.2 setosa\n",
"2 2 4.7 3.2 1.3 0.2 setosa\n",
"3 3 4.6 3.1 1.5 0.2 setosa\n",
"4 4 5.0 3.6 1.4 0.2 setosa"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Here with an indexer\n",
"idadb.add_column_id(idadf, destructive=True)\n",
"\n",
"# Now the idadataframe has the \"ID\" column set as indexer.\n",
"print(\"\")\n",
"print(\"The idadataframe now has an indexer. Its name is %s.\" %idadf.indexer)\n",
"idadf.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The table has been physically modified. If you are using a Db2 table which already has an eligible ID column, you can set this column as indexer when defining the IdaDataFrame:\n",
"> idadf = IdaDataFrame(idadb, \"DB2_TABLE_NAME\", indexer = \"ID_COLUMN_NAME\")\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are ready to perform selection and projection."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 10 | \n",
" 5.4 | \n",
" 3.7 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 1 | \n",
" 11 | \n",
" 4.8 | \n",
" 3.4 | \n",
" 1.6 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 2 | \n",
" 12 | \n",
" 4.8 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.1 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID sepal_length sepal_width petal_length petal_width species\n",
"0 10 5.4 3.7 1.5 0.2 setosa\n",
"1 11 4.8 3.4 1.6 0.2 setosa\n",
"2 12 4.8 3.0 1.4 0.1 setosa"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### Selection ###\n",
"\n",
"# no warning : we have an indexer\n",
"idadf[10:12].head()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 10 | \n",
" 5.4 | \n",
" 3.7 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 1 | \n",
" 11 | \n",
" 4.8 | \n",
" 3.4 | \n",
" 1.6 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 2 | \n",
" 12 | \n",
" 4.8 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.1 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID sepal_length sepal_width petal_length petal_width species\n",
"0 10 5.4 3.7 1.5 0.2 setosa\n",
"1 11 4.8 3.4 1.6 0.2 setosa\n",
"2 12 4.8 3.0 1.4 0.1 setosa"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# It is equivalent to the following using .loc\n",
"idadf_new2 = idadf.loc[10:12]\n",
"\n",
"print(type(idadf_new2))\n",
"# Note: The output is still an IdaDataFrame here (a whole table like a Pandas DataFrame), \n",
"# not an IdaSeries (single column like a Pandas Series).\n",
"\n",
"idadf_new2.head()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
"
\n",
" \n",
" | 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
"
\n",
" \n",
" | 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
"
\n",
" \n",
" | 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
"
\n",
" \n",
" | 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width\n",
"0 5.1 3.5\n",
"1 4.9 3.0\n",
"2 4.7 3.2\n",
"3 4.6 3.1\n",
"4 5.0 3.6"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### Projection ###\n",
"\n",
"# Select columns\n",
"idadf_new3 = idadf[['sepal_length', 'sepal_width']]\n",
"print(type(idadf_new3))\n",
"idadf_new3.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Sorting__\n",
"\n",
"You can sort rows with one or more columns as sorting criterion."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 13 | \n",
" 4.3 | \n",
" 3.0 | \n",
" 1.1 | \n",
" 0.1 | \n",
" setosa | \n",
"
\n",
" \n",
" | 1 | \n",
" 38 | \n",
" 4.4 | \n",
" 3.0 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 2 | \n",
" 8 | \n",
" 4.4 | \n",
" 2.9 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 3 | \n",
" 42 | \n",
" 4.4 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 4 | \n",
" 41 | \n",
" 4.5 | \n",
" 2.3 | \n",
" 1.3 | \n",
" 0.3 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID sepal_length sepal_width petal_length petal_width species\n",
"0 13 4.3 3.0 1.1 0.1 setosa\n",
"1 38 4.4 3.0 1.3 0.2 setosa\n",
"2 8 4.4 2.9 1.4 0.2 setosa\n",
"3 42 4.4 3.2 1.3 0.2 setosa\n",
"4 41 4.5 2.3 1.3 0.3 setosa"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# One column\n",
"idadf_new = idadf.sort(\"sepal_length\")\n",
"idadf_new.head()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 13 | \n",
" 4.3 | \n",
" 3.0 | \n",
" 1.1 | \n",
" 0.1 | \n",
" setosa | \n",
"
\n",
" \n",
" | 1 | \n",
" 8 | \n",
" 4.4 | \n",
" 2.9 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 2 | \n",
" 38 | \n",
" 4.4 | \n",
" 3.0 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 3 | \n",
" 42 | \n",
" 4.4 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 4 | \n",
" 41 | \n",
" 4.5 | \n",
" 2.3 | \n",
" 1.3 | \n",
" 0.3 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID sepal_length sepal_width petal_length petal_width species\n",
"0 13 4.3 3.0 1.1 0.1 setosa\n",
"1 8 4.4 2.9 1.4 0.2 setosa\n",
"2 38 4.4 3.0 1.3 0.2 setosa\n",
"3 42 4.4 3.2 1.3 0.2 setosa\n",
"4 41 4.5 2.3 1.3 0.3 setosa"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Several columns\n",
"idadf_new = idadf.sort([\"sepal_length\",\"sepal_width\"])\n",
"idadf_new.head()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" petal_length | \n",
" petal_width | \n",
" sepal_length | \n",
" sepal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" 5.1 | \n",
" 3.5 | \n",
" setosa | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 1.4 | \n",
" 0.2 | \n",
" 4.9 | \n",
" 3.0 | \n",
" setosa | \n",
"
\n",
" \n",
" | 2 | \n",
" 2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 3 | \n",
" 3 | \n",
" 1.5 | \n",
" 0.2 | \n",
" 4.6 | \n",
" 3.1 | \n",
" setosa | \n",
"
\n",
" \n",
" | 4 | \n",
" 4 | \n",
" 1.4 | \n",
" 0.2 | \n",
" 5.0 | \n",
" 3.6 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID petal_length petal_width sepal_length sepal_width species\n",
"0 0 1.4 0.2 5.1 3.5 setosa\n",
"1 1 1.4 0.2 4.9 3.0 setosa\n",
"2 2 1.3 0.2 4.7 3.2 setosa\n",
"3 3 1.5 0.2 4.6 3.1 setosa\n",
"4 4 1.4 0.2 5.0 3.6 setosa"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Sort by columns using axis option\n",
"# 0 for rows, 1 for columns\n",
"idadf.sort(axis = 1, inplace=True)\n",
"idadf.head()\n",
"\n",
"# by default, rows are sorted by ID"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Filtering__\n",
"\n",
"You can filter rows with the usual comparison syntax."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" petal_length | \n",
" petal_width | \n",
" sepal_length | \n",
" sepal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 50 | \n",
" 4.7 | \n",
" 1.4 | \n",
" 7.0 | \n",
" 3.2 | \n",
" versicolor | \n",
"
\n",
" \n",
" | 1 | \n",
" 51 | \n",
" 4.5 | \n",
" 1.5 | \n",
" 6.4 | \n",
" 3.2 | \n",
" versicolor | \n",
"
\n",
" \n",
" | 2 | \n",
" 52 | \n",
" 4.9 | \n",
" 1.5 | \n",
" 6.9 | \n",
" 3.1 | \n",
" versicolor | \n",
"
\n",
" \n",
" | 3 | \n",
" 53 | \n",
" 4.0 | \n",
" 1.3 | \n",
" 5.5 | \n",
" 2.3 | \n",
" versicolor | \n",
"
\n",
" \n",
" | 4 | \n",
" 54 | \n",
" 4.6 | \n",
" 1.5 | \n",
" 6.5 | \n",
" 2.8 | \n",
" versicolor | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID petal_length petal_width sepal_length sepal_width species\n",
"0 50 4.7 1.4 7.0 3.2 versicolor\n",
"1 51 4.5 1.5 6.4 3.2 versicolor\n",
"2 52 4.9 1.5 6.9 3.1 versicolor\n",
"3 53 4.0 1.3 5.5 2.3 versicolor\n",
"4 54 4.6 1.5 6.5 2.8 versicolor"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"idadf_new = idadf[idadf['species'] == 'versicolor']\n",
"idadf_new.head()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" petal_length | \n",
" petal_width | \n",
" sepal_length | \n",
" sepal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" 1.4 | \n",
" 0.2 | \n",
" 4.9 | \n",
" 3.0 | \n",
" setosa | \n",
"
\n",
" \n",
" | 1 | \n",
" 2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 2 | \n",
" 3 | \n",
" 1.5 | \n",
" 0.2 | \n",
" 4.6 | \n",
" 3.1 | \n",
" setosa | \n",
"
\n",
" \n",
" | 3 | \n",
" 6 | \n",
" 1.4 | \n",
" 0.3 | \n",
" 4.6 | \n",
" 3.4 | \n",
" setosa | \n",
"
\n",
" \n",
" | 4 | \n",
" 8 | \n",
" 1.4 | \n",
" 0.2 | \n",
" 4.4 | \n",
" 2.9 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID petal_length petal_width sepal_length sepal_width species\n",
"0 1 1.4 0.2 4.9 3.0 setosa\n",
"1 2 1.3 0.2 4.7 3.2 setosa\n",
"2 3 1.5 0.2 4.6 3.1 setosa\n",
"3 6 1.4 0.3 4.6 3.4 setosa\n",
"4 8 1.4 0.2 4.4 2.9 setosa"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"idadf_new = idadf[idadf['sepal_length'] < 5]\n",
"idadf_new.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Feature engineering__\n",
"\n",
"Again, very Pandas-like! Enjoy the simplicity of using a one-line code which corresponds to a complex SQL query. Translation is done by ibmdbpy!"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" petal_length | \n",
" petal_width | \n",
" sepal_length | \n",
" sepal_width | \n",
" species | \n",
" new_col | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" 5.1 | \n",
" 3.5 | \n",
" setosa | \n",
" 17.85 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 1.4 | \n",
" 0.2 | \n",
" 4.9 | \n",
" 3.0 | \n",
" setosa | \n",
" 14.70 | \n",
"
\n",
" \n",
" | 2 | \n",
" 2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" setosa | \n",
" 15.04 | \n",
"
\n",
" \n",
" | 3 | \n",
" 3 | \n",
" 1.5 | \n",
" 0.2 | \n",
" 4.6 | \n",
" 3.1 | \n",
" setosa | \n",
" 14.26 | \n",
"
\n",
" \n",
" | 4 | \n",
" 4 | \n",
" 1.4 | \n",
" 0.2 | \n",
" 5.0 | \n",
" 3.6 | \n",
" setosa | \n",
" 18.00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID petal_length petal_width sepal_length sepal_width species new_col\n",
"0 0 1.4 0.2 5.1 3.5 setosa 17.85\n",
"1 1 1.4 0.2 4.9 3.0 setosa 14.70\n",
"2 2 1.3 0.2 4.7 3.2 setosa 15.04\n",
"3 3 1.5 0.2 4.6 3.1 setosa 14.26\n",
"4 4 1.4 0.2 5.0 3.6 setosa 18.00"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a new column by multiplying two existing columns\n",
"idadf['new_col'] = idadf['sepal_length'] * idadf['sepal_width']\n",
"idadf.head()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" petal_length | \n",
" petal_width | \n",
" sepal_length | \n",
" sepal_width | \n",
" species | \n",
" new_col | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" 2.55 | \n",
" 3.5 | \n",
" setosa | \n",
" 17.85 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 1.4 | \n",
" 0.2 | \n",
" 2.45 | \n",
" 3.0 | \n",
" setosa | \n",
" 14.70 | \n",
"
\n",
" \n",
" | 2 | \n",
" 2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" 2.35 | \n",
" 3.2 | \n",
" setosa | \n",
" 15.04 | \n",
"
\n",
" \n",
" | 3 | \n",
" 3 | \n",
" 1.5 | \n",
" 0.2 | \n",
" 2.30 | \n",
" 3.1 | \n",
" setosa | \n",
" 14.26 | \n",
"
\n",
" \n",
" | 4 | \n",
" 4 | \n",
" 1.4 | \n",
" 0.2 | \n",
" 2.50 | \n",
" 3.6 | \n",
" setosa | \n",
" 18.00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID petal_length petal_width sepal_length sepal_width species new_col\n",
"0 0 1.4 0.2 2.55 3.5 setosa 17.85\n",
"1 1 1.4 0.2 2.45 3.0 setosa 14.70\n",
"2 2 1.3 0.2 2.35 3.2 setosa 15.04\n",
"3 3 1.5 0.2 2.30 3.1 setosa 14.26\n",
"4 4 1.4 0.2 2.50 3.6 setosa 18.00"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Replace existing column by newly computed values \n",
"idadf['sepal_length'] /=2 \n",
"idadf.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Saving your work__\n",
"\n",
"Save your work in a new Db2 table. If you look for the IRIS table on Db2 Warehouse console, you will see that an ID column has been indeed added... but the sorting and projections and operations we performed above have not been written to the database! This changes only appear on the IdaDataFrame and will be lost once the connection is closed. So if you want to save your work, you need to either overwrite the original Db2 table or create a new table with a new name.\n",
"\n",
"To overwrite the `IRIS` table:\n",
"> idadf.save_as(tablename = \"IRIS\", clear_existing = True)\n",
"\n",
"To create a new table:\n",
"> idadf.save_as(tablename = \"NEW_NAME\", clear_existing = True)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/eva.feillet/anaconda3/lib/python3.7/site-packages/ibmdbpy/frame.py:1899: UserWarning: Table TEST already exists.\n",
" warnings.warn(message, UserWarning)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Are you sure that you want to overwrite TEST [y/n] y\n"
]
}
],
"source": [
"idadf.save_as(tablename = \"TEST\", clear_existing = True)\n",
"# note: we have set clear_existing to True in case you run the notebook several times"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Wrap-up__\n",
"\n",
"Nice job! You are now familiar with operations on IdaDataFrames with ibmdbpy. Let's go to the next section, where you see how to use ibmdbpy's machine learning functionalities."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Apply Machine Learning techniques with ibmdbpy\n",
"\n",
"Not only does ibmdbpy provide you with in-database Pandas-like functionalities, but also with in-database Scikit-learn-like functionalities. Algorithms dedicated to KMeans clustering, association-rule mining and Naive Bayes classification are provided in `ibmdbpy.learn`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Example: KMeans algorithm__"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"from ibmdbpy.learn import KMeans\n",
"kmeans = KMeans(3) # configure clustering with 3 cluters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: In-DataBase Kmeans needs an indexer to identify each row. As checked above this is the case for our IdaDataFrame. We can go ahead with model fitting and prediction!"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Uploading 150 rows (maxnrow was set to 20000)\n",
"ID\n"
]
}
],
"source": [
"# Taking a fresh new start\n",
"from ibmdbpy.sampledata import iris\n",
"idadf = idadb.as_idadataframe(iris, \"IRIS\", clear_existing = True)\n",
"\n",
"# Reminder: an indexer must have been set for using .fit\n",
"idadb.add_column_id(idadf, destructive=True)\n",
"\n",
"# here we check that we have an indexer\n",
"print(idadf.indexer)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" CLUSTER_ID | \n",
" DISTANCE | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0 | \n",
" 2 | \n",
" 0.141351 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 2 | \n",
" 0.447638 | \n",
"
\n",
" \n",
" | 2 | \n",
" 2 | \n",
" 2 | \n",
" 0.417109 | \n",
"
\n",
" \n",
" | 3 | \n",
" 3 | \n",
" 2 | \n",
" 0.525338 | \n",
"
\n",
" \n",
" | 4 | \n",
" 4 | \n",
" 2 | \n",
" 0.188627 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID CLUSTER_ID DISTANCE\n",
"0 0 2 0.141351\n",
"1 1 2 0.447638\n",
"2 2 2 0.417109\n",
"3 3 2 0.525338\n",
"4 4 2 0.188627"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Model fitting and prediction \n",
"kmeans.fit(idadf[['ID', 'sepal_length','sepal_width','petal_length','petal_width']])\n",
"pred = kmeans.predict(idadf)\n",
"pred.head()\n",
"# distance : distance to respective cluster center"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By calling the `describe` method, you access the characteristics of the model which you have just build with `fit`. Here you see the coordinates (values) of the cluster centers in the K-Means model. For example, the first cluster center's petals have a length of 5.549020 units."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"KMeans clustering with 3 clusters of sizes 37, 63, 50\n",
"\n",
"Cluster means: \n",
" CLUSTERID sepal_length sepal_width petal_length petal_width\n",
"0 1 6.601587 2.985714 5.384127 1.915873\n",
"1 2 5.006000 3.428000 1.462000 0.246000\n",
"2 3 5.683784 2.678378 4.091892 1.267568\n",
"\n",
"Within cluster sum of squares by cluster:\n",
"[16.57416896 53.22701047 15.151 ]\n"
]
}
],
"source": [
"kmeans.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Take a closer look at the predictions:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
" CLUSTER_ID | \n",
" DISTANCE | \n",
"
\n",
" \n",
" | ID | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
" 2 | \n",
" 0.141351 | \n",
"
\n",
" \n",
" | 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
" 2 | \n",
" 0.447638 | \n",
"
\n",
" \n",
" | 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
" 2 | \n",
" 0.417109 | \n",
"
\n",
" \n",
" | 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
" 2 | \n",
" 0.525338 | \n",
"
\n",
" \n",
" | 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
" 2 | \n",
" 0.188627 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species CLUSTER_ID \\\n",
"ID \n",
"0 5.1 3.5 1.4 0.2 setosa 2 \n",
"1 4.9 3.0 1.4 0.2 setosa 2 \n",
"2 4.7 3.2 1.3 0.2 setosa 2 \n",
"3 4.6 3.1 1.5 0.2 setosa 2 \n",
"4 5.0 3.6 1.4 0.2 setosa 2 \n",
"\n",
" DISTANCE \n",
"ID \n",
"0 0.141351 \n",
"1 0.447638 \n",
"2 0.417109 \n",
"3 0.525338 \n",
"4 0.188627 "
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris_df = idadf.as_dataframe().set_index('ID')\n",
"pred_df = pred.as_dataframe().set_index('ID')\n",
"merged_df = iris_df.merge(pred_df, left_index=True, right_index=True)\n",
"merged_df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course there are a few misclassified examples, but the original classes have been broadly rediscovered. Label 1 corresponds to the virginica specy, label 2 to setosa and label 3 to versicolor."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | species | \n",
" setosa | \n",
" versicolor | \n",
" virginica | \n",
"
\n",
" \n",
" | CLUSTER_ID | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1 | \n",
" 0 | \n",
" 12 | \n",
" 47 | \n",
"
\n",
" \n",
" | 2 | \n",
" 50 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" | 3 | \n",
" 0 | \n",
" 38 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"species setosa versicolor virginica\n",
"CLUSTER_ID \n",
"1 0 12 47\n",
"2 50 0 0\n",
"3 0 38 3"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# overview\n",
"pd.crosstab(merged_df['CLUSTER_ID'], merged_df['species'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Other ML algorithms__\n",
"\n",
"Ibmdbpy also provides you with tools for association rules mining and naive bayes classification.\n",
"\n",
"For more details, see the separate notebooks about machine learning with ibmdbpy on our GitHub repository!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"__Final step: Close the connection__\n",
"\n",
"Closing the IdaDataBase is equivalent to closing the connection: once the connection is closed, it is not possible to use the IdaDataBase instance and any IdaDataFrame instances that were opened on this connection anymore."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Connection closed.\n"
]
}
],
"source": [
"idadb.close()\n",
"# to reconnect: idadb.reconnect()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Where to go from here ?\n",
"\n",
"Well done! You are now familiar with the basic functionalities of ibmdbpy! You are ready to explore the other notebooks of this series:\n",
"\n",
"* Getting started with ibmdbpy geospatial functions:\n",
" \n",
" [GeoBasics](./ibmdbpy_GettingStarted_2-geo_basics.ipynb)\n",
"\n",
" [Extensive guide](./ibmdbpy_GettingStarted_3-geo_guide.ipynb)\n",
" \n",
"\n",
"* Ibmdbpy in practice : analyze the Museums dataset, understand how to create IdaDataFrames and IdaGeoDataFrame:\n",
" \n",
" [Preprocessing](../MuseumsUseCase/ibmdbpy_Museums_DataAnalysis_1-preprocessing.ipynb)\n",
"\n",
" [Geospatial recommendation](../MuseumsUseCase/ibmdbpy_Museums_DataAnalysis_2-geospatial.ipynb)\n",
"\n",
"\n",
"* Machine learning with ibmdbpy: \n",
" \n",
" [Naïve Bayes](../MachineLearning/ibmdbpy_NaiveBayes.ipynb)\n",
"\n",
" [Association Rules Mining](../MachineLearning/ibmdbpy_AssociationRulesMining.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"____\n",
"\n",
"__Authors__\n",
"\n",
"Eva Feillet - ML intern, IBM Cloud and Cognitive Software @ IBM Lab in Böblingen, Germany"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}