{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning with ibmdbpy and Db2 Warehouse: Association Rules Mining \n", "\n", "This notebook explains how to perform association rules mining with the ibmdbpy Python package and data stored in Db2 Warehouse.\n", "___\n", "### Accelerate Python analytics with in-database processing by using ibmdbpy and IBM Db2 Warehouse\n", " \n", "The ibmdbpy project provides a Python interface for data manipulation and access to in-database algorithms in IBM Db2 Warehouse. It accelerates Python analytics by seamlessly pushing operations written in Python into the underlying database for execution, thereby benefitting from in-database performance-enhancing features, such as columnar storage and parallel processing. For more details about ibmdbpy, please refer to the [documentation](https://pythonhosted.org/ibmdbpy/index.html) and to the dedicated [Git repository](https://github.com/ibmdbanalytics/ibmdbpy/tree/master/ibmdbpy). This notebook provides you with an overview of ibmdbpy functionalities. \n", "\n", " \n", "__Prerequisites__\n", "* Db2 account: see [IBM Cloud](https://cloud.ibm.com/login) or [Db2 Warehouse](https://www.ibm.com/support/knowledgecenter/en/SSCJDQ/com.ibm.swg.im.dashdb.kc.doc/welcome.html)\n", "* Db2 driver: learn more on [IBM Knowledge Center](https://www.ibm.com/support/knowledgecenter/en/SSFMBX/com.ibm.swg.im.dashdb.doc/connecting/connect_applications_by_type.html) and see [IBM Support](https://www.ibm.com/support/pages/db2-jdbc-driver-versions-and-downloads)\n", "* Having installed the [ibmdbpy](https://pypi.org/project/ibmdbpy/) python library with pip: \n", "> pip install ibmdbpy \n", "* Optional dependency for JDBC is the [jaydebeapi](https://pypi.org/project/JayDeBeApi/) library. Run the following command to install ibmdbpy, as well as the dependencies for the JDBC feature:\n", "> pip install ibmdbpy[jdbc]\n", "* Sample data: follow this [steps](https://www.ibm.com/support/knowledgecenter/SSCJDQ/com.ibm.swg.im.dashdb.doc/admin/sampledata.html ) to download sample data for Db2. \n", "\n", "__Association rules mining__\n", "\n", "Association rules mining is a common technique to better know your customers' habits. It answers the question: what items are commonly bought together? As it enables you to more efficiently analyze the shopping baskets of your customers, it also gives you hints about which promotions or actions to propose. For example, if flour, sugar and couverture chocolate are frequently bought together, discounting couverture chocolate might boost the sales of flour and sugar, because they are the basic ingredients you need for a tasty cake with a nice chocolate topping. You can find more details about associations rules mining and its support in Db2 Warehouse in the [IBM Knowledge Center](https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.analytics.doc/doc/r_association_rules.html). \n", "\n", "__Contents__\n", "\n", "This notebook uses the transactions dataset of an outdoor store. Let's see what patterns we find!\n", "\n", "__1. Get started__\n", "* Import what you need\n", "* Establish a JDBC connection to Db2 Warehouse\n", "* Explore the data\n", " \n", "__2. Mine Association Rules with ibmdbpy__\n", "* Reminder about Association Rules Mining\n", " * General task\n", " * Definitions\n", "* Put your knowledge into practice\n", " * Define an AssociationRules object\n", " * Analyze the results\n", " * Experiment for yourself with pruning!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Get started" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Import what you need__\n", "\n", "As usual, we first import some basic modules from ibmdbpy." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from ibmdbpy import IdaDataBase, IdaDataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Establish a JDBC connection to Db2 Warehouse__\n", "\n", "Please enter your credentials in order to connect to the database." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#Enter the values for you database connection\n", "dsn_database = \"___\" # e.g. \"BLUDB\"\n", "dsn_hostname = \"___\" # e.g.: \"abc.url.example\"\n", "dsn_port = \"___\" # e.g. \"50000\"\n", "dsn_uid = \"___\" # e.g. \"db2_1234\"\n", "dsn_pwd = \"___\" # e.g. \"zorglub\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#Establish connection\n", "connection_string='jdbc:db2://'+dsn_hostname+':'+dsn_port+'/'+dsn_database+':user='+dsn_uid+';password='+dsn_pwd+\";\" \n", "idadb=IdaDataBase(dsn=connection_string, verbose = False)\n", "# Set verbose to True if you want to see the detail of ibmdbpy operations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default the environment variable `AUTOCOMMIT` is then set to True, which means that every SQL statement which is submitted through the connection is executed within its own transaction and then committed implicitly. When you close the connection to Db2, if the environment variable `AUTOCOMMIT` is set to False, then all changes after the last explicit commit are discarded. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Explore the data__\n", "\n", "We use Db2 Warehouse sample data that can be downloaded [here](https://www.ibm.com/support/knowledgecenter/SSCJDQ/com.ibm.swg.im.dashdb.doc/admin/sampledata.html) . \n", "\n", "* GOSALES.ORDER_DETAILS contains data about transactions: product, price, quantity, shipment etc. \n", "* GOSALES.PRODUCT presents information about each product's characteristics: product type, color, brand etc. \n", "* GOSALES.PRODUCT_TYPE contains the name of each product type in various languages such as arabic, german, english, corean etc. \n", "* GOSALES.PRODUCT_NAME_LOOKUP contains, for each product, its name in all contries' language where it is sold. \n", "\n", "Let's define IdaDataFrames to explore our data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* __Explore the original tables__" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ORDER_DETAIL_CODEORDER_NUMBERSHIP_DATEPRODUCT_NUMBERPROMOTION_CODEQUANTITYUNIT_COSTUNIT_PRICEUNIT_SALE_PRICE
010000011000012010-01-19 00:00:0041110025615.6235.0933.69
110000021000012010-02-17 00:00:006911009249.69110.00102.30
210000131000022010-01-19 00:00:0075110016280.00119.69111.31
310000141000022010-01-19 00:00:0076110017223.5340.5238.90
410000151000022010-01-19 00:00:0085110074176.47359.60334.43
\n", "
" ], "text/plain": [ " ORDER_DETAIL_CODE ORDER_NUMBER SHIP_DATE PRODUCT_NUMBER \\\n", "0 1000001 100001 2010-01-19 00:00:00 41110 \n", "1 1000002 100001 2010-02-17 00:00:00 69110 \n", "2 1000013 100002 2010-01-19 00:00:00 75110 \n", "3 1000014 100002 2010-01-19 00:00:00 76110 \n", "4 1000015 100002 2010-01-19 00:00:00 85110 \n", "\n", " PROMOTION_CODE QUANTITY UNIT_COST UNIT_PRICE UNIT_SALE_PRICE \n", "0 0 256 15.62 35.09 33.69 \n", "1 0 92 49.69 110.00 102.30 \n", "2 0 162 80.00 119.69 111.31 \n", "3 0 172 23.53 40.52 38.90 \n", "4 0 74 176.47 359.60 334.43 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "orders = IdaDataFrame(idadb, \"GOSALES.ORDER_DETAILS\")\n", "orders.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Distinct transactions: 53256\n" ] } ], "source": [ "o1 = orders['ORDER_NUMBER'].count_distinct()\n", "print(\"Distinct transactions: %s\" %o1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will apply the association rules mining algorithm to the `orders` IdaDataFrame : it contains both the transaction IDs (column `ORDER_NUMBER`) and the product IDs (column `PRODUCT_NUMBER`). The other tables are useful to get some additional or contextual information." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PRODUCT_NUMBERBASE_PRODUCT_NUMBERINTRODUCTION_DATEDISCONTINUED_DATEPRODUCT_TYPE_CODEPRODUCT_COLOR_CODEPRODUCT_SIZE_CODEPRODUCT_BRAND_CODEPRODUCT_IMAGE
0111012001-02-15 00:00:00None951908808701P01CE1CG1.jpg
1211022001-02-15 00:00:00None951906807701P02CE1CG1.jpg
2311032001-02-15 00:00:00None951924825701P03CE1CG1.jpg
3411042001-02-15 00:00:00None951923804701P04CE1CG1.jpg
4511052001-02-15 00:00:00None951923823701P05CE1CG1.jpg
\n", "
" ], "text/plain": [ " PRODUCT_NUMBER BASE_PRODUCT_NUMBER INTRODUCTION_DATE DISCONTINUED_DATE \\\n", "0 1110 1 2001-02-15 00:00:00 None \n", "1 2110 2 2001-02-15 00:00:00 None \n", "2 3110 3 2001-02-15 00:00:00 None \n", "3 4110 4 2001-02-15 00:00:00 None \n", "4 5110 5 2001-02-15 00:00:00 None \n", "\n", " PRODUCT_TYPE_CODE PRODUCT_COLOR_CODE PRODUCT_SIZE_CODE \\\n", "0 951 908 808 \n", "1 951 906 807 \n", "2 951 924 825 \n", "3 951 923 804 \n", "4 951 923 823 \n", "\n", " PRODUCT_BRAND_CODE PRODUCT_IMAGE \n", "0 701 P01CE1CG1.jpg \n", "1 701 P02CE1CG1.jpg \n", "2 701 P03CE1CG1.jpg \n", "3 701 P04CE1CG1.jpg \n", "4 701 P05CE1CG1.jpg " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "products = IdaDataFrame(idadb, \"GOSALES.PRODUCT\")\n", "products.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many different products do we have?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Distinct products: 274\n" ] } ], "source": [ "n = products['PRODUCT_NUMBER'].count_distinct()\n", "print(\"Distinct products: %s\" %n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The only common column of ORDER_DETAILS and PRODUCT is `PRODUCT_NUMBER`.\n", "\n", "The only common column of PRODUCT and PRODUCT_TYPE is `PRODUCT_TYPE_CODE`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PRODUCT_TYPE_CODEPRODUCT_LINE_CODEPRODUCT_TYPE_ENPRODUCT_TYPE_ARPRODUCT_TYPE_CSPRODUCT_TYPE_DAPRODUCT_TYPE_DEPRODUCT_TYPE_ELPRODUCT_TYPE_ESPRODUCT_TYPE_FI...PRODUCT_TYPE_PLPRODUCT_TYPE_PTPRODUCT_TYPE_ROPRODUCT_TYPE_RUPRODUCT_TYPE_SCPRODUCT_TYPE_SLPRODUCT_TYPE_SVPRODUCT_TYPE_TCPRODUCT_TYPE_THPRODUCT_TYPE_TR
0951991Cooking Gearمعدات الطهىPotřeby pro vařeníKogegrejKochzubehörΕξοπλισμός μαγειρικήςMaterial de cocinaKeittovälineet...Sprzęt kuchennyMaterial cozinhaEchipament pentru gătitКухонные принадлежности炊事用具Oprema za kuhanjeKöksredskap炊具อุปกรณ์ทำอาหารMutfak aletleri
1952991TentsالخيامStanyTelteZelteΣκηνέςTiendasTeltat...NamiotyBarracasCorturiПалатки帐篷ŠotoriTält營帳เต็นท์Çadırlar
2953991Sleeping BagsالمناماتSpacákySoveposerSchlafsäckeΥπνόσακοιSacos de dormirMakuupussit...ŚpiworySaco de dormirSaci de dormitСпальные мешки睡袋Spalne vrečeSovsäckar睡袋ถุงนอนUyku Tulumları
3954991PacksالحقائبBatohyRygsækkeRucksäckeΣακίδιαMochilasReput...PlecakiMochilasPacheteРюкзаки и сумки背包NahrbtnikiRyggsäckar背包เป้Sırt Çantaları
4955991LanternsالفوانيسSvítilnyLygterLampenΛάμπεςLinternasLyhdyt...LatarkiLanternasLanterneФонари手提灯SvetilkeLampor營燈ตะเกียงFenerler
\n", "

5 rows × 31 columns

\n", "
" ], "text/plain": [ " PRODUCT_TYPE_CODE PRODUCT_LINE_CODE PRODUCT_TYPE_EN PRODUCT_TYPE_AR \\\n", "0 951 991 Cooking Gear معدات الطهى \n", "1 952 991 Tents الخيام \n", "2 953 991 Sleeping Bags المنامات \n", "3 954 991 Packs الحقائب \n", "4 955 991 Lanterns الفوانيس \n", "\n", " PRODUCT_TYPE_CS PRODUCT_TYPE_DA PRODUCT_TYPE_DE PRODUCT_TYPE_EL \\\n", "0 Potřeby pro vaření Kogegrej Kochzubehör Εξοπλισμός μαγειρικής \n", "1 Stany Telte Zelte Σκηνές \n", "2 Spacáky Soveposer Schlafsäcke Υπνόσακοι \n", "3 Batohy Rygsække Rucksäcke Σακίδια \n", "4 Svítilny Lygter Lampen Λάμπες \n", "\n", " PRODUCT_TYPE_ES PRODUCT_TYPE_FI ... PRODUCT_TYPE_PL PRODUCT_TYPE_PT \\\n", "0 Material de cocina Keittovälineet ... Sprzęt kuchenny Material cozinha \n", "1 Tiendas Teltat ... Namioty Barracas \n", "2 Sacos de dormir Makuupussit ... Śpiwory Saco de dormir \n", "3 Mochilas Reput ... Plecaki Mochilas \n", "4 Linternas Lyhdyt ... Latarki Lanternas \n", "\n", " PRODUCT_TYPE_RO PRODUCT_TYPE_RU PRODUCT_TYPE_SC \\\n", "0 Echipament pentru gătit Кухонные принадлежности 炊事用具 \n", "1 Corturi Палатки 帐篷 \n", "2 Saci de dormit Спальные мешки 睡袋 \n", "3 Pachete Рюкзаки и сумки 背包 \n", "4 Lanterne Фонари 手提灯 \n", "\n", " PRODUCT_TYPE_SL PRODUCT_TYPE_SV PRODUCT_TYPE_TC PRODUCT_TYPE_TH \\\n", "0 Oprema za kuhanje Köksredskap 炊具 อุปกรณ์ทำอาหาร \n", "1 Šotori Tält 營帳 เต็นท์ \n", "2 Spalne vreče Sovsäckar 睡袋 ถุงนอน \n", "3 Nahrbtniki Ryggsäckar 背包 เป้ \n", "4 Svetilke Lampor 營燈 ตะเกียง \n", "\n", " PRODUCT_TYPE_TR \n", "0 Mutfak aletleri \n", "1 Çadırlar \n", "2 Uyku Tulumları \n", "3 Sırt Çantaları \n", "4 Fenerler \n", "\n", "[5 rows x 31 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "types = IdaDataFrame(idadb, \"GOSALES.PRODUCT_TYPE\")\n", "types.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many different product types do we have?" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Distinct product types: 21\n" ] } ], "source": [ "t1 = types['PRODUCT_TYPE_CODE'].count_distinct()\n", "print('Distinct product types: %s' %t1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PRODUCT_NAME_LOOKUP contains the names of the products in several languages as well as their description." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PRODUCT_NUMBERPRODUCT_LANGUAGEPRODUCT_NAMEPRODUCT_DESCRIPTION
01110ARحقيبة مياه TrailChefحقيبة حفيفة قابلة للطي لحمل السوائل بطريقة سهل...
11110ELΔοχείο υγρών Μαρμίτα ΣεφΕλαφρύ, πτυσσόμενο δοχείο για εύκολη μεταφορά ...
21110DETrailChef WasserbeutelLeichter, zusammenfaltbarer Beutel zum einfach...
31110DASahara VandtaskeSammentrykkelig letvægtstaske til væsker. Bred...
41110CSVak na vodu KuchtíkLehký, skladný vak na tekutiny. Široké hrdlo u...
\n", "
" ], "text/plain": [ " PRODUCT_NUMBER PRODUCT_LANGUAGE PRODUCT_NAME \\\n", "0 1110 AR حقيبة مياه TrailChef \n", "1 1110 EL Δοχείο υγρών Μαρμίτα Σεφ \n", "2 1110 DE TrailChef Wasserbeutel \n", "3 1110 DA Sahara Vandtaske \n", "4 1110 CS Vak na vodu Kuchtík \n", "\n", " PRODUCT_DESCRIPTION \n", "0 حقيبة حفيفة قابلة للطي لحمل السوائل بطريقة سهل... \n", "1 Ελαφρύ, πτυσσόμενο δοχείο για εύκολη μεταφορά ... \n", "2 Leichter, zusammenfaltbarer Beutel zum einfach... \n", "3 Sammentrykkelig letvægtstaske til væsker. Bred... \n", "4 Lehký, skladný vak na tekutiny. Široké hrdlo u... " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "names = IdaDataFrame(idadb, \"GOSALES.PRODUCT_NAME_LOOKUP\")\n", "names.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here you see that products with different product numbers (IDs) may have the same product name in the same language, for example because there is just a variation of size or color. There are less product names and descriptions than product numbers." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PRODUCT_NUMBERPRODUCT_LANGUAGEPRODUCT_NAMEPRODUCT_DESCRIPTION
264151110ENAstro PilotThis GPS has a high-contrast 86mm color screen...
265151120ENAstro PilotThis GPS has a high-contrast 86mm color screen...
266151130ENAstro PilotThis GPS has a high-contrast 86mm color screen...
267152110ENSky PilotSmall and lightweight, this GPS has a 55mm col...
268153110ENAuto PilotThe Auto Pilot is an easy-to-use navigation sy...
269154110ENKodiakThe Kodiak includes a tough leather strap, pre...
270154120ENKodiakThe Kodiak includes a tough leather strap, pre...
271154130ENKodiakThe Kodiak includes a tough leather strap, pre...
272154140ENKodiakThe Kodiak includes a tough leather strap, pre...
273154150ENKodiakThe Kodiak includes a tough leather strap, pre...
\n", "
" ], "text/plain": [ " PRODUCT_NUMBER PRODUCT_LANGUAGE PRODUCT_NAME \\\n", "264 151110 EN Astro Pilot \n", "265 151120 EN Astro Pilot \n", "266 151130 EN Astro Pilot \n", "267 152110 EN Sky Pilot \n", "268 153110 EN Auto Pilot \n", "269 154110 EN Kodiak \n", "270 154120 EN Kodiak \n", "271 154130 EN Kodiak \n", "272 154140 EN Kodiak \n", "273 154150 EN Kodiak \n", "\n", " PRODUCT_DESCRIPTION \n", "264 This GPS has a high-contrast 86mm color screen... \n", "265 This GPS has a high-contrast 86mm color screen... \n", "266 This GPS has a high-contrast 86mm color screen... \n", "267 Small and lightweight, this GPS has a 55mm col... \n", "268 The Auto Pilot is an easy-to-use navigation sy... \n", "269 The Kodiak includes a tough leather strap, pre... \n", "270 The Kodiak includes a tough leather strap, pre... \n", "271 The Kodiak includes a tough leather strap, pre... \n", "272 The Kodiak includes a tough leather strap, pre... \n", "273 The Kodiak includes a tough leather strap, pre... " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "names[names[\"PRODUCT_LANGUAGE\"]==\"EN\"].tail(10)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PRODUCT_NUMBER 274\n", "PRODUCT_LANGUAGE 1\n", "PRODUCT_NAME 144\n", "PRODUCT_DESCRIPTION 141\n", "dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "names[names[\"PRODUCT_LANGUAGE\"]==\"EN\"].count_distinct()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* __Create a view__\n", "\n", "\n", "As you can see from the results of the last statements, the value of PRODUCT_NAME does not always identify a single product. In addition it is sometimes not clear from name of product to which type of product it belongs. So let’s create the view PRODUCT_NAME_TYPE_LOOKUP using the GOALES.PRODUCT, GOSALES.PRODUCT_NAME_LOOKUP and GOSALES.PRODUCT_TYPE table where the concatenation of PRODUCT_NAME, PRODUCT_NUMBER and PRODUCT_TYPE_NAME can be used as a unique and understandable name for a product. \n", "\n", "This new view will be useful to get the full English name of a product in order to interpret the rules." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Create view PRODUCT_NAME_TYPE_LOOKUP\n", "\n", "idadb.ida_query(\"create or replace view PRODUCT_NAME_TYPE_LOOKUP as select p.PRODUCT_NUMBER, PRODUCT_NAME, PRODUCT_TYPE_EN, PRODUCT_NAME || '-' || n.PRODUCT_NUMBER || '-' || PRODUCT_TYPE_EN as PRODUCT_NAME_TYPE from GOSALES.PRODUCT p, GOSALES.PRODUCT_NAME_LOOKUP n, GOSALES.PRODUCT_TYPE t where p.PRODUCT_NUMBER = n.PRODUCT_NUMBER and p.PRODUCT_TYPE_CODE = t.PRODUCT_TYPE_CODE and n.PRODUCT_LANGUAGE='EN';\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a quick look at it. We define an IdaDataFrame pointing at this view, with `PRODUCT_NUMBER` as indexer column." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PRODUCT_NUMBERPRODUCT_NAMEPRODUCT_TYPE_ENPRODUCT_NAME_TYPE
01110TrailChef Water BagCooking GearTrailChef Water Bag-1110-Cooking Gear
12110TrailChef CanteenCooking GearTrailChef Canteen-2110-Cooking Gear
23110TrailChef Kitchen KitCooking GearTrailChef Kitchen Kit-3110-Cooking Gear
34110TrailChef CupCooking GearTrailChef Cup-4110-Cooking Gear
45110TrailChef Cook SetCooking GearTrailChef Cook Set-5110-Cooking Gear
\n", "
" ], "text/plain": [ " PRODUCT_NUMBER PRODUCT_NAME PRODUCT_TYPE_EN \\\n", "0 1110 TrailChef Water Bag Cooking Gear \n", "1 2110 TrailChef Canteen Cooking Gear \n", "2 3110 TrailChef Kitchen Kit Cooking Gear \n", "3 4110 TrailChef Cup Cooking Gear \n", "4 5110 TrailChef Cook Set Cooking Gear \n", "\n", " PRODUCT_NAME_TYPE \n", "0 TrailChef Water Bag-1110-Cooking Gear \n", "1 TrailChef Canteen-2110-Cooking Gear \n", "2 TrailChef Kitchen Kit-3110-Cooking Gear \n", "3 TrailChef Cup-4110-Cooking Gear \n", "4 TrailChef Cook Set-5110-Cooking Gear " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Define IdaDataFrame\n", "idadf = IdaDataFrame(idadb, 'PRODUCT_NAME_TYPE_LOOKUP', indexer = 'PRODUCT_NUMBER')\n", "idadf.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use this view for association rules interpretation. Instead of displaying rules as a complicated list of IDs, we want to be able to read them directly as \"product A goes with product B\", in plain language. \n", "\n", "But as seen before, products with distinct IDs may have the same product name. So in order to be able to differentiate them we concatenate the name and the ID. Additionally we concatenate the product type to have a better understanding of the product at first glance. This is what you see in the `PRODUCT_NAME_TYPE` column." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Distinct products: 274\n", "Distinct product names: 274\n" ] } ], "source": [ "#SANITY CHECK: we should obtainthe same number of distinct items\n", "n1 = idadf['PRODUCT_NUMBER'].count_distinct()\n", "n2 = idadf['PRODUCT_NAME_TYPE'].count_distinct()\n", "\n", "print(\"Distinct products: %s\" %n1)\n", "print(\"Distinct product names: %s\" %n2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Right, we now have a good overview of our datasets. Let's mine association rules!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Mine Association Rules with ibmdbpy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reminder about Association Rules Mining\n", "\n", "\n", "__General task__\n", "\n", "A collection of instances, here the content of shopping baskets, is our input. We want to output rules to predict the values of any attribute from the values of other attributes: e.g. knowing some items of a shopping basket, what other products are likely to be purchased?\n", "\n", "* Definitions\n", "\n", "**Support of an itemset in a transaction list**: fraction of transactions of the dataset containing this itemset.\n", "> support(itemset) = number of transactions containing the itemset / total number of transactions in the dataset\n", "\n", "For example, if your transaction dataset is {{bread, peanut butter}, {milk, bread, eggs, bacon}, {milk, bread, butter}}, then the support of the itemset {bread, milk} is 2/3.\n", "\n", "**Frequent itemset**: an itemset whose support in your transaction list is greater than some threshold t.\n", "\n", "**Rule R**: an implication of type \"X--> Y\". For example: {peanut butter} --> {bread}, {milk, bread, eggs} --> {bacon, butter}, {beer, crips} --> {pizza}\n", "\n", "Note: \"-->\" means co-occurrence... not causality!\n", "\n", "**Rule head** and **rule body**: The principal parts of an association rule are the rule body (also referred to as antecedent) and the rule head (also referred to as consequent). For example, where X --> Y is an association rule: The item set X is the rule body. The item Y is the rule head.\n", "\n", "* Evaluation of association rules\n", "\n", "Take the association rule R: \"X-->Y\" where X and Y are two itemsets.\n", "Then the support of the rule R is defined by the fraction of transactions containing all the elements from both X and Y.\n", "> support(\"X-->Y\") = support(X ∪ Y)\n", "\n", "The confidence of the rule is the fraction of itemsets containing X and Y relatively to the number of itemsets containing X. \n", "> confidence(\"X-->Y\") = support(\"X-->Y\") / support(X) = support(X ∪ Y) / support(X)\n", "\n", "The lift of a rule can give a better assessment than its confidence because it considers both the confidence of the rule and the distribution of Y, hence the overall data set. It can be seen as the rule confidence divided by the support of the rule head Y.\n", "> lift (\"X-->Y\") = confidence(\"X-->Y\") / support(Y)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Put your knowledge into practice\n", "\n", "We use ibmdbpy library dedicated to machine learning. You can visit the documentation [here](https://pythonhosted.org/ibmdbpy/association_rules.html)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "import ibmdbpy.learn.association_rules as assoc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Define an AssociationRules object__\n", "\n", "It is ready to be used for fitting and prediction:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "Assoc = assoc.AssociationRules(modelname=\"SHOPPING_RULES\", minsupport=0.04, maxlen=4, maxheadlen=1, minconf=0.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The AssociationRules class provides an interface for using the `ASSOCRULES` and `PREDICT_ASSOCRULES` IDAX methods of Db2.\n", "\n", "* **minsupport**: The minimum fraction (0.0 - 1.0) or the minimum number (above 1) of transactions that must contain a pattern to be considered as frequent. By default, it ranges between >0.0 and <1.0 for a minimum fraction, or above 1 for a minimum number of transactions.\n", "\n", "* **maxlen**: The maximum length of a pattern or a rule, that is, the maximum number of items per pattern or rule. Default is 5.\n", "\n", "* **maxheadlen**: The maximum length of a rule head, that is, the maximum number of items that might belong to the item set on the right side of a rule. Increasing this value might significantly increase the number of detected rules.\n", "\n", "* **minconf** : float, optional. It is the minimum confidence that a rule must achieve to be kept in the model of the pattern. >=0.0 and <= 1, default is 0.5. " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'modelname': 'SHOPPING_RULES',\n", " 'minsupport': 0.04,\n", " 'maxlen': 4,\n", " 'maxheadlen': 1,\n", " 'minconf': 0.2,\n", " 'nametable': None,\n", " 'namecol': None,\n", " 'outtable': None,\n", " 'type': None,\n", " 'limit': None,\n", " 'sort': None}" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Inspect the object\n", "Assoc.get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that many parameters, such as `nametable` and `namecol` have not been set yet. They can be set at fitting step. `nametable` is a table containing the explicit names of all products, with their IDs. We specify `namecol` to explicitly identify, in this table, the column containing the names." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "assocpatterns\n", " ITEMSETID ITEMID\n", "0 1 1\n", "1 2 2\n", "2 3 3\n", "3 4 4\n", "4 5 5\n", "5 6 6\n", "6 7 7\n", "7 8 8\n", "8 9 9\n", "9 10 10\n", "10 11 11\n", "11 12 12\n", "12 13 13\n", "13 14 14\n", "14 15 15\n", "15 16 16\n", "16 17 17\n", "17 18 18\n", "18 19 19\n", "19 20 20\n", "20 21 21\n", "21 22 22\n", "22 23 23\n", "23 24 24\n", "24 25 25\n", "25 26 26\n", "26 27 27\n", "27 28 28\n", "28 29 29\n", "29 30 30\n", ".. ... ...\n", "46 47 47\n", "47 48 48\n", "48 49 49\n", "49 50 50\n", "50 51 51\n", "51 52 52\n", "52 53 53\n", "53 54 54\n", "54 57 5\n", "55 62 19\n", "56 64 38\n", "57 65 38\n", "58 55 4\n", "59 56 5\n", "60 58 10\n", "61 59 12\n", "62 60 14\n", "63 61 19\n", "64 63 38\n", "65 55 38\n", "66 56 38\n", "67 57 41\n", "68 58 38\n", "69 59 38\n", "70 60 38\n", "71 61 38\n", "72 62 41\n", "73 63 41\n", "74 64 42\n", "75 65 51\n", "\n", "[76 rows x 2 columns]\n", " \n", "assocpatterns_stats\n", " ITEMSETID LENGTH COUNT SUPPORT LIFT PRUNED\n", "0 54 1 2254 0.042324 1.000000 0\n", "1 57 2 2483 0.046624 2.976649 0\n", "2 17 1 3129 0.058754 1.000000 0\n", "3 23 1 2539 0.047675 1.000000 0\n", "4 60 2 2451 0.046023 2.040719 0\n", "5 19 1 6520 0.122428 1.000000 0\n", "6 9 1 2400 0.045065 1.000000 0\n", "7 15 1 3705 0.069570 1.000000 0\n", "8 30 1 2319 0.043544 1.000000 0\n", "9 35 1 2892 0.054304 1.000000 0\n", "10 63 2 4165 0.078207 2.536326 0\n", "11 56 2 2903 0.054510 2.525859 0\n", "12 51 1 5241 0.098411 1.000000 0\n", "13 50 1 3028 0.056857 1.000000 0\n", "14 46 1 4073 0.076480 1.000000 0\n", "15 5 1 5576 0.104702 1.000000 0\n", "16 10 1 5679 0.106636 1.000000 0\n", "17 34 1 3745 0.070321 1.000000 0\n", "18 41 1 7967 0.149598 1.000000 0\n", "19 4 1 5678 0.106617 1.000000 0\n", "20 28 1 2696 0.050623 1.000000 0\n", "21 14 1 5827 0.109415 1.000000 0\n", "22 18 1 3989 0.074902 1.000000 0\n", "23 61 2 3246 0.060951 2.415382 0\n", "24 26 1 4173 0.078357 1.000000 0\n", "25 7 1 2437 0.045760 1.000000 0\n", "26 29 1 4445 0.083465 1.000000 0\n", "27 53 1 2912 0.054679 1.000000 0\n", "28 21 1 4238 0.079578 1.000000 0\n", "29 38 1 10977 0.206118 1.000000 0\n", ".. ... ... ... ... ... ...\n", "35 62 2 2349 0.044108 2.408292 0\n", "36 42 1 5102 0.095801 1.000000 0\n", "37 45 1 3958 0.074320 1.000000 0\n", "38 32 1 3777 0.070922 1.000000 0\n", "39 24 1 3027 0.056839 1.000000 0\n", "40 2 1 2447 0.045948 1.000000 0\n", "41 37 1 2686 0.050436 1.000000 0\n", "42 40 1 2230 0.041873 1.000000 0\n", "43 22 1 3543 0.066528 1.000000 0\n", "44 27 1 2343 0.043995 1.000000 0\n", "45 1 1 2511 0.047150 1.000000 0\n", "46 59 2 2164 0.040634 1.948925 0\n", "47 52 1 4288 0.080517 1.000000 0\n", "48 8 1 4527 0.085005 1.000000 0\n", "49 48 1 4131 0.077569 1.000000 0\n", "50 55 2 2497 0.046887 2.133576 0\n", "51 65 2 2341 0.043957 2.167066 0\n", "52 6 1 3763 0.070659 1.000000 0\n", "53 43 1 4247 0.079747 1.000000 0\n", "54 25 1 4201 0.078883 1.000000 0\n", "55 11 1 3795 0.071260 1.000000 0\n", "56 20 1 4273 0.080235 1.000000 0\n", "57 47 1 4543 0.085305 1.000000 0\n", "58 12 1 5387 0.101153 1.000000 0\n", "59 16 1 3926 0.073719 1.000000 0\n", "60 44 1 4459 0.083728 1.000000 0\n", "61 13 1 2677 0.050267 1.000000 0\n", "62 58 2 2492 0.046793 2.128928 0\n", "63 36 1 2877 0.054022 1.000000 0\n", "64 3 1 2670 0.050135 1.000000 0\n", "\n", "[65 rows x 6 columns]\n", " \n", "assocrules\n", " RULEID ITEMSETID BODYID HEADID CONFIDENCE PRUNED\n", "0 1 55 4 38 0.439768 0\n", "1 2 55 38 4 0.227476 0\n", "2 3 56 5 38 0.520624 0\n", "3 4 56 38 5 0.264462 0\n", "4 5 57 5 41 0.445301 0\n", "5 6 57 41 5 0.311661 0\n", "6 7 58 10 38 0.438810 0\n", "7 8 58 38 10 0.227020 0\n", "8 9 59 12 38 0.401708 0\n", "9 10 60 14 38 0.420628 0\n", "10 11 60 38 14 0.223285 0\n", "11 12 61 19 38 0.497853 0\n", "12 13 61 38 19 0.295709 0\n", "13 14 62 19 41 0.360276 0\n", "14 15 62 41 19 0.294841 0\n", "15 16 63 38 41 0.379430 0\n", "16 17 63 41 38 0.522781 0\n", "17 18 64 38 42 0.210713 0\n", "18 19 64 42 38 0.453352 0\n", "19 20 65 38 51 0.213264 0\n", "20 21 65 51 38 0.446670 0\n", " \n", "items\n", " ITEMID ITEM ITEMNAME COUNT SUPPORT\n", "0 1 122140 Bella-122140-Eyewear 2511 0.047150\n", "1 2 124110 Cat Eye-124110-Eyewear 2447 0.045948\n", "2 3 124140 Cat Eye-124140-Eyewear 2670 0.050135\n", "3 4 124190 Cat Eye-124190-Eyewear 5678 0.106617\n", "4 5 125110 Venue-125110-Watches 5576 0.104702\n", "5 6 125120 Venue-125120-Watches 3763 0.070659\n", "6 7 125150 Venue-125150-Watches 2437 0.045760\n", "7 8 126110 Dante-126110-Eyewear 4527 0.085005\n", "8 9 126130 Dante-126130-Eyewear 2400 0.045065\n", "9 10 126140 Dante-126140-Eyewear 5679 0.106636\n", "10 11 127110 Fairway-127110-Eyewear 3795 0.071260\n", "11 12 127130 Fairway-127130-Eyewear 5387 0.101153\n", "12 13 128130 Inferno-128130-Eyewear 2677 0.050267\n", "13 14 128140 Inferno-128140-Eyewear 5827 0.109415\n", "14 15 128150 Inferno-128150-Eyewear 3705 0.069570\n", "15 16 128200 Inferno-128200-Eyewear 3926 0.073719\n", "16 17 128210 Inferno-128210-Eyewear 3129 0.058754\n", "17 18 129110 Infinity-129110-Watches 3989 0.074902\n", "18 19 129130 Infinity-129130-Watches 6520 0.122428\n", "19 20 129150 Infinity-129150-Watches 4273 0.080235\n", "20 21 129180 Infinity-129180-Watches 4238 0.079578\n", "21 22 130110 Lux-130110-Watches 3543 0.066528\n", "22 23 130130 Lux-130130-Watches 2539 0.047675\n", "23 24 131120 Max Gizmo-131120-Knives 3027 0.056839\n", "24 25 132110 Maximus-132110-Eyewear 4201 0.078883\n", "25 26 132120 Maximus-132120-Eyewear 4173 0.078357\n", "26 27 132140 Maximus-132140-Eyewear 2343 0.043995\n", "27 28 132150 Maximus-132150-Eyewear 2696 0.050623\n", "28 29 132170 Maximus-132170-Eyewear 4445 0.083465\n", "29 30 134120 Pocket Gizmo-134120-Knives 2319 0.043544\n", "30 31 135110 Ranger Vision-135110-Binoculars 2540 0.047694\n", "31 32 135120 Ranger Vision-135120-Binoculars 3777 0.070922\n", "32 33 136140 Sam-136140-Watches 4292 0.080592\n", "33 34 143110 Trendi-143110-Eyewear 3745 0.070321\n", "34 35 143120 Trendi-143120-Eyewear 2892 0.054304\n", "35 36 144150 TX-144150-Watches 2877 0.054022\n", "36 37 144170 TX-144170-Watches 2686 0.050436\n", "37 38 144180 TX-144180-Watches 10977 0.206118\n", "38 39 145130 Legend-145130-Watches 4081 0.076630\n", "39 40 145160 Legend-145160-Watches 2230 0.041873\n", "40 41 145170 Legend-145170-Watches 7967 0.149598\n", "41 42 147110 Zone-147110-Eyewear 5102 0.095801\n", "42 43 147120 Zone-147120-Eyewear 4247 0.079747\n", "43 44 147130 Zone-147130-Eyewear 4459 0.083728\n", "44 45 147160 Zone-147160-Eyewear 3958 0.074320\n", "45 46 147170 Zone-147170-Eyewear 4073 0.076480\n", "46 47 147180 Zone-147180-Eyewear 4543 0.085305\n", "47 48 148110 Hawk Eye-148110-Eyewear 4131 0.077569\n", "48 49 148120 Hawk Eye-148120-Eyewear 3440 0.064594\n", "49 50 148130 Hawk Eye-148130-Eyewear 3028 0.056857\n", "50 51 149140 Retro-149140-Eyewear 5241 0.098411\n", "51 52 151110 Astro Pilot-151110-Navigation 4288 0.080517\n", "52 53 152110 Sky Pilot-152110-Navigation 2912 0.054679\n", "53 54 154150 Kodiak-154150-Watches 2254 0.042324\n", " \n" ] } ], "source": [ "Assoc.fit(orders, transaction_id=\"ORDER_NUMBER\", item_id=\"PRODUCT_NUMBER\", nametable = \"PRODUCT_NAME_TYPE_LOOKUP\", namecol='PRODUCT_NAME_TYPE', verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Understand the intermediary tables**\n", "\n", "In the cell above, intermediate tables have been printed because we have set `verbose` to True. You can also obtain them with `AssociationRules.describe(detail=True)`. Here is an explanation of these tables.\n", "\n", "The first table **assocpatterns** shows which itemset (`ITEMSETID` column) contains which items (`ITEMID` column). Only the frequent itemsets are shown i.e. itemsets that are eligible according to the constraints you have set on your AssociationRules object. \n", "\n", "For example, the table above tells you that 65 itemsets are considered to be frequent. \n", "\n", " 1:{1}, 2:{2}, 3:{3}, ... 55:{4,38},..., 59:{12,38},..., 62:{19,41},..., 64:{42,38}, 65:{38, 51} \n", " \n", "Frequent itemset with itemsetid 1 contains product with itemid 1, frequent itemset with itemsetid 2 contains product with itemid 2, ..., frequent itemset with itemsetid 55 contains products with itemid 4 and 38, ... frequent itemset with itemsetid 65 contains products with itemid 38 and 51.\n", " \n", "Note that the item IDs are the IDs of the items in table \"item\". They are not the original product IDs from the product table.\n", "\n", "The second table **assocpatterns_stats** provides for each itemset some metrics: its cardinal (`LENGTH`), its absolute frequency in the dataset (`COUNT`), its support and its lift, as well as whether it was \"pruned\".\n", "\n", "The third table **assocrules** gives you the rules that have been found, according to the constraints you have set. Here we have 21 rules. \n", "\n", " * Rule 1: itemset 4 --> itemset 38 id est {4} --> {38}, corresponds to itemset 55: {4,38};\n", " * Rule 3: itemset 5 --> itemset 38 id est {5} --> {38}, corresponds to itemset 56: {5,38};\n", " * Rule 18: itemset 38 --> itemset 42 id est {38} --> {42}, corresponds to itemset 64: {38,42};\n", " * Rule 21: itemset 51 --> itemset 38 id est {51} --> {38}, corresponds to itemset 65: {38,51}.\n", " \n", "The **items** table provides the mapping between new item IDs (`ITEMID`) and original item IDs (`ITEM`). It uses the `PRODUCT_NAME_TYPE_LOOKUP`table for interpretation and also gives support and count for each item." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Analyze the results__\n", "\n", "Printing all these intermediary tables with new IDs can be confusing... The `describe()` method is there to provide you with an overview of the rules, with product names for better understanding." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Summary of the rules\n", " ID RULE SUPPORT LIFT \\\n", "0 17 (Legend-145170-Watches) => (TX-144180-Watches) 0.078207 2.536326 \n", "1 16 (TX-144180-Watches) => (Legend-145170-Watches) 0.078207 2.536326 \n", "2 12 (Infinity-129130-Watches) => (TX-144180-Watches) 0.060951 2.415382 \n", "3 13 (TX-144180-Watches) => (Infinity-129130-Watches) 0.060951 2.415382 \n", "4 3 (Venue-125110-Watches) => (TX-144180-Watches) 0.054510 2.525859 \n", "5 4 (TX-144180-Watches) => (Venue-125110-Watches) 0.054510 2.525859 \n", "6 1 (Cat Eye-124190-Eyewear) => (TX-144180-Watches) 0.046887 2.133576 \n", "7 2 (TX-144180-Watches) => (Cat Eye-124190-Eyewear) 0.046887 2.133576 \n", "8 7 (Dante-126140-Eyewear) => (TX-144180-Watches) 0.046793 2.128928 \n", "9 8 (TX-144180-Watches) => (Dante-126140-Eyewear) 0.046793 2.128928 \n", "10 5 (Venue-125110-Watches) => (Legend-145170-Watches) 0.046624 2.976649 \n", "11 6 (Legend-145170-Watches) => (Venue-125110-Watches) 0.046624 2.976649 \n", "12 10 (Inferno-128140-Eyewear) => (TX-144180-Watches) 0.046023 2.040719 \n", "13 11 (TX-144180-Watches) => (Inferno-128140-Eyewear) 0.046023 2.040719 \n", "14 14 (Infinity-129130-Watches) => (Legend-145170-Wa... 0.044108 2.408292 \n", "15 15 (Legend-145170-Watches) => (Infinity-129130-Wa... 0.044108 2.408292 \n", "16 21 (Retro-149140-Eyewear) => (TX-144180-Watches) 0.043957 2.167066 \n", "17 20 (TX-144180-Watches) => (Retro-149140-Eyewear) 0.043957 2.167066 \n", "18 19 (Zone-147110-Eyewear) => (TX-144180-Watches) 0.043432 2.199480 \n", "19 18 (TX-144180-Watches) => (Zone-147110-Eyewear) 0.043432 2.199480 \n", "20 9 (Fairway-127130-Eyewear) => (TX-144180-Watches) 0.040634 1.948925 \n", "\n", " CONFIDENCE \n", "0 0.522781 \n", "1 0.379430 \n", "2 0.497853 \n", "3 0.295709 \n", "4 0.520624 \n", "5 0.264462 \n", "6 0.439768 \n", "7 0.227476 \n", "8 0.438810 \n", "9 0.227020 \n", "10 0.445301 \n", "11 0.311661 \n", "12 0.420628 \n", "13 0.223285 \n", "14 0.360276 \n", "15 0.294841 \n", "16 0.446670 \n", "17 0.213264 \n", "18 0.453352 \n", "19 0.210713 \n", "20 0.401708 \n" ] } ], "source": [ "# You can print the tables again with the describe method:\n", "Assoc.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Comment\n", "\n", "It seems like in this outdoor store, clients often buy several watches or several pair of glasses at a time and also eyewear and watches together. For example if you buy a \"Legend\" watch you are likely to buy a \"Venue\" watch too. Same for \"Fairway\" eyewear and \"TX\" watch, \"Retro\" eyewear and \"TX\" watch etc. \n", "\n", "The best rule links the items \"TX-144180-Watches\" (support = 0.206118) and \"Legend-145170-Watches\" (support = 0.149598) and has the highest lift as well as the highest confidence. When a customer buys a Legend watch he is twice more likely to buy a TX watch as the others.\n", "\n", "\"Max Gizmo\" knives are \"frequent\" because their support was above the threshold but this product does not appear in any of the rules above because no rule using this product could satisfy the confidence constraint we defined." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Confidence interpretation\n", "\n", "Don't be mislead by high confidence values! A rule with higher confidence could seem more accurate, because better supported. However we have to assess the accuracy of a rule relatively to the dataset. So a high confidence does not necessarily mean that there is a high level of implication in the rule. Sometimes it's just that Y is frequent (or much more frequent than X), so that support(X ∪ Y) / support(X) is high. That is why we use the lift metric.\n", "\n", "\n", "* Lift interpretation \n", "\n", "Remember that the value of lift gives a better assessment because it considers both the confidence of the rule and the distribution of Y, hence the overall data set. \n", "\n", "If the lift is **equal to 1**, then the probability of occurrence of the antecedent X and that of the consequent Y are independent of each other. This means that **no rule** can be drawn involving these two events.\n", "If the lift is **above 1**, then it tells us to which degree these two occurrences are **dependent on one another**, and we can try to use these rules to make prediction on future datasets.\n", "If the lift is **under 1**, then the items are **substitute to each other**. This means that presence of one item has a negative effect on presence of the other item and vice versa. If you buy product X you are less likely to buy product Y and vice versa." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Experiment for yourself with pruning__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pruning enables you to remove rules and pattern which you are not interested in. Here we proceed by filtering the rules that have been selected above." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Summary of the rules\n", " ID RULE SUPPORT LIFT \\\n", "0 17 (Legend-145170-Watches) => (TX-144180-Watches) 0.078207 2.536326 \n", "1 16 (TX-144180-Watches) => (Legend-145170-Watches) 0.078207 2.536326 \n", "2 12 (Infinity-129130-Watches) => (TX-144180-Watches) 0.060951 2.415382 \n", "3 13 (TX-144180-Watches) => (Infinity-129130-Watches) 0.060951 2.415382 \n", "\n", " CONFIDENCE \n", "0 0.522781 \n", "1 0.379430 \n", "2 0.497853 \n", "3 0.295709 \n" ] } ], "source": [ "Assoc.prune(itemsin=None, itemsout=None, minlen=1, maxlen=None, \n", " minsupport=0.06, maxsupport=1, minlift=None, maxlift=None, \n", " minconf=None, maxconf=None, reset=True)\n", "\n", "Assoc.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: when printing the intermediary table by setting the option `detail=True` in `describe()`, rules and patterns that have been removed from the model are tagged with a 1 in column `PRUNED`. As shown in the summary table above, only 4 rules remain with our current filtering constraints. Also note the symmetry of the lift and support metrics." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "__Close the connection to Db2__\n", "\n", "Congratulations! You reached the end of this notebook. Before closing the notebook, close the connection to Db2 with a simple line of code." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Connection closed.\n" ] } ], "source": [ "# Close the connection to IBM Db2 Warehouse\n", "idadb.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Congratulations!__ You are ready to perform association rules mining with ibmdbpy!\n", "\n", "\n", "## Where to go from here ?\n", "\n", "More examples of ibmdbpy capabilities and ML algorithms are available on this GitHub repository.\n", "\n", "* Getting started with ibmdbpy :\n", " \n", " [Basics](../GettingStarted/ibmdbpy_GettingStarted_1-basics.ipynb)\n", "\n", " [GeoBasics](../GettingStarted/ibmdbpy_GettingStarted_2-geo_basics.ipynb)\n", "\n", " [Extensive Guide](../GettingStarted/ibmdbpy_GettingStarted_3-geo_guide.ipynb)\n", " \n", "\n", "* More practice : analyze the Museums dataset, understand how to create IdaDataFrames and IdaGeoDataFrames:\n", " \n", " [Preprocessing](../MuseumsUseCase/ibmdbpy_Museums_DataAnalysis_1-preprocessing.ipynb)\n", "\n", " [Geospatial recommendation](../MuseumsUseCase/ibmdbpy_Museums_DataAnalysis_2-geospatial.ipynb)\n", "\n", "\n", "* Machine learning with ibmdbpy: \n", " \n", " [Naïve Bayes](../MachineLearning/ibmdbpy_NaiveBayes.ipynb)\n", "\n", " [KMeans](../GettingStarted/ibmdbpy_GettingStarted_1-basics.ipynb) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "________\n", "__Author__\n", "\n", "Eva Feillet, ML intern @ IBM Research & Development Lab, Germany. IBM Cloud and Cognitive Software, Böblingen, Germany." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }