{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Association Rule Learning\n", "\n", "This tutorial goes over [association rule learning](https://en.wikipedia.org/wiki/Association_rule_learning) and the [Apriori algorithm](https://en.wikipedia.org/wiki/Apriori_algorithm). It goes over the following:\n", "1. Custom but simple implemenation of Apriori which uses data from a csv (easy to use but not efficient)\n", "2. Implementation using Orange\n", "3. Implementation using R programming language (very streamlined & fast implementation)\n", "\n", "\n", "## Custom Impelmentation\n", "\n", "This code is a Python Implementation of Apriori Algorithm for finding Frequent sets and Association Rules taken from [this](https://github.com/asaini/Apriori). The advantage is that it does not require much manipulation of the data and can create association rules from a .csv file directly. \n", "\n", "The first part of the code defines custom implemented methods/functions for apriori, and the second part sets the support and confidence and run's the apriori algorithm.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import sys\n", "\n", "from itertools import chain, combinations\n", "from collections import defaultdict\n", "from optparse import OptionParser\n", "\n", "\n", "def subsets(arr):\n", " \"\"\" Returns non empty subsets of arr\"\"\"\n", " return chain(*[combinations(arr, i + 1) for i, a in enumerate(arr)])\n", "\n", "\n", "def returnItemsWithMinSupport(itemSet, transactionList, minSupport, freqSet):\n", " \"\"\"calculates the support for items in the itemSet and returns a subset\n", " of the itemSet each of whose elements satisfies the minimum support\"\"\"\n", " _itemSet = set()\n", " localSet = defaultdict(int)\n", "\n", " for item in itemSet:\n", " for transaction in transactionList:\n", " if item.issubset(transaction):\n", " freqSet[item] += 1\n", " localSet[item] += 1\n", "\n", " for item, count in localSet.items():\n", " support = float(count)/len(transactionList)\n", "\n", " if support >= minSupport:\n", " _itemSet.add(item)\n", "\n", " return _itemSet\n", "\n", "\n", "def joinSet(itemSet, length):\n", " \"\"\"Join a set with itself and returns the n-element itemsets\"\"\"\n", " return set([i.union(j) for i in itemSet for j in itemSet if len(i.union(j)) == length])\n", "\n", "\n", "def getItemSetTransactionList(data_iterator):\n", " transactionList = list()\n", " itemSet = set()\n", " for record in data_iterator:\n", " transaction = frozenset(record)\n", " transactionList.append(transaction)\n", " for item in transaction:\n", " itemSet.add(frozenset([item])) # Generate 1-itemSets\n", " return itemSet, transactionList\n", "\n", "\n", "def runApriori(data_iter, minSupport, minConfidence):\n", " \"\"\"\n", " run the apriori algorithm. data_iter is a record iterator\n", " Return both:\n", " - items (tuple, support)\n", " - rules ((pretuple, posttuple), confidence)\n", " \"\"\"\n", " itemSet, transactionList = getItemSetTransactionList(data_iter)\n", "\n", " freqSet = defaultdict(int)\n", " largeSet = dict()\n", " # Global dictionary which stores (key=n-itemSets,value=support)\n", " # which satisfy minSupport\n", "\n", " assocRules = dict()\n", " # Dictionary which stores Association Rules\n", "\n", " oneCSet = returnItemsWithMinSupport(itemSet,\n", " transactionList,\n", " minSupport,\n", " freqSet)\n", "\n", " currentLSet = oneCSet\n", " k = 2\n", " while(currentLSet != set([])):\n", " largeSet[k-1] = currentLSet\n", " currentLSet = joinSet(currentLSet, k)\n", " currentCSet = returnItemsWithMinSupport(currentLSet,\n", " transactionList,\n", " minSupport,\n", " freqSet)\n", " currentLSet = currentCSet\n", " k = k + 1\n", "\n", " def getSupport(item):\n", " \"\"\"local function which Returns the support of an item\"\"\"\n", " return float(freqSet[item])/len(transactionList)\n", "\n", " toRetItems = []\n", " for key, value in largeSet.items():\n", " toRetItems.extend([(tuple(item), getSupport(item))\n", " for item in value])\n", "\n", " toRetRules = []\n", " for key, value in largeSet.items()[1:]:\n", " for item in value:\n", " _subsets = map(frozenset, [x for x in subsets(item)])\n", " for element in _subsets:\n", " remain = item.difference(element)\n", " if len(remain) > 0:\n", " confidence = getSupport(item)/getSupport(element)\n", " if confidence >= minConfidence:\n", " toRetRules.append(((tuple(element), tuple(remain)),\n", " confidence))\n", " return toRetItems, toRetRules\n", "\n", "\n", "def printResults(items, rules):\n", " \"\"\"prints the generated itemsets sorted by support and the confidence rules sorted by confidence\"\"\"\n", " for item, support in sorted(items, key=lambda (item, support): support):\n", " print \"item: %s , %.3f\" % (str(item), support)\n", " print \"\\n------------------------ RULES:\"\n", " for rule, confidence in sorted(rules, key=lambda (rule, confidence): confidence):\n", " pre, post = rule\n", " print \"Rule: %s ==> %s , %.3f\" % (str(pre), str(post), confidence)\n", "\n", "\n", "def dataFromFile(fname):\n", " \"\"\"Function which reads from the file and yields a generator\"\"\"\n", " file_iter = open(fname, 'rU')\n", " for line in file_iter:\n", " line = line.strip().rstrip(',') # Remove trailing comma\n", " record = frozenset(line.split(','))\n", " yield record" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "item: ('Brooklyn',) , 0.152\n", "item: ('HISPANIC',) , 0.164\n", "item: ('HISPANIC', 'MBE') , 0.164\n", "item: ('MBE', 'WBE') , 0.169\n", "item: ('MBE', 'New York') , 0.170\n", "item: ('WBE', 'New York') , 0.175\n", "item: ('MBE', 'ASIAN') , 0.200\n", "item: ('ASIAN',) , 0.202\n", "item: ('New York',) , 0.295\n", "item: ('NON-MINORITY',) , 0.300\n", "item: ('NON-MINORITY', 'WBE') , 0.300\n", "item: ('BLACK',) , 0.301\n", "item: ('MBE', 'BLACK') , 0.301\n", "item: ('WBE',) , 0.477\n", "item: ('MBE',) , 0.671\n", "\n", "------------------------ RULES:\n", "Rule: ('WBE',) ==> ('NON-MINORITY',) , 0.628\n", "Rule: ('ASIAN',) ==> ('MBE',) , 0.990\n", "Rule: ('HISPANIC',) ==> ('MBE',) , 1.000\n", "Rule: ('BLACK',) ==> ('MBE',) , 1.000\n", "Rule: ('NON-MINORITY',) ==> ('WBE',) , 1.000\n", "item: ('Brooklyn',) , 0.152\n", "item: ('HISPANIC',) , 0.164\n", "item: ('HISPANIC', 'MBE') , 0.164\n", "item: ('MBE', 'WBE') , 0.169\n", "item: ('MBE', 'New York') , 0.170\n", "item: ('WBE', 'New York') , 0.175\n", "item: ('MBE', 'ASIAN') , 0.200\n", "item: ('ASIAN',) , 0.202\n", "item: ('New York',) , 0.295\n", "item: ('NON-MINORITY',) , 0.300\n", "item: ('NON-MINORITY', 'WBE') , 0.300\n", "item: ('BLACK',) , 0.301\n", "item: ('MBE', 'BLACK') , 0.301\n", "item: ('WBE',) , 0.477\n", "item: ('MBE',) , 0.671\n", "\n", "------------------------ RULES:\n", "Rule: ('WBE',) ==> ('NON-MINORITY',) , 0.628\n", "Rule: ('ASIAN',) ==> ('MBE',) , 0.990\n", "Rule: ('HISPANIC',) ==> ('MBE',) , 1.000\n", "Rule: ('BLACK',) ==> ('MBE',) , 1.000\n", "Rule: ('NON-MINORITY',) ==> ('WBE',) , 1.000\n", "item: ('Brooklyn',) , 0.152\n", "item: ('HISPANIC',) , 0.164\n", "item: ('HISPANIC', 'MBE') , 0.164\n", "item: ('MBE', 'WBE') , 0.169\n", "item: ('MBE', 'New York') , 0.170\n", "item: ('WBE', 'New York') , 0.175\n", "item: ('MBE', 'ASIAN') , 0.200\n", "item: ('ASIAN',) , 0.202\n", "item: ('New York',) , 0.295\n", "item: ('NON-MINORITY',) , 0.300\n", "item: ('NON-MINORITY', 'WBE') , 0.300\n", "item: ('BLACK',) , 0.301\n", "item: ('MBE', 'BLACK') , 0.301\n", "item: ('WBE',) , 0.477\n", "item: ('MBE',) , 0.671\n", "\n", "------------------------ RULES:\n", "Rule: ('WBE',) ==> ('NON-MINORITY',) , 0.628\n", "Rule: ('ASIAN',) ==> ('MBE',) , 0.990\n", "Rule: ('HISPANIC',) ==> ('MBE',) , 1.000\n", "Rule: ('BLACK',) ==> ('MBE',) , 1.000\n", "Rule: ('NON-MINORITY',) ==> ('WBE',) , 1.000\n" ] } ], "source": [ "inFile = dataFromFile('../datasets/INTEGRATED-DATASET.csv')\n", "minSupport = 0.15\n", "minConfidence = 0.6\n", "\n", "items, rules = runApriori(inFile, minSupport, minConfidence)\n", "\n", "printResults(items, rules)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Associate Rule Learning Using Orange\n", "\n", "[Orange](http://orange.biolab.si/) is an Open source data visualization and data analysis tool with interactive workflows and a large toolbox. As noted in the tutorial video, Orange also provides a graphical user interface which can be used as a method to quickly prototype rule mining. For more on the GUI you can read the [getting started](http://orange.biolab.si/getting-started/) and [avaliable modules](http://orange.biolab.si/docs/latest/widgets/rst/)\n", "\n", "Since the above aprori algorithm is not as streamlined as a library implementation, [Orange's implementation of Apriori](http://orange.biolab.si/docs/latest/reference/rst/Orange.associate.html) can be used. The following code is based upon examples datasets provided by Orange given [here](http://orange.biolab.si/docs/latest/reference/rst/Orange.associate.html)\n", "\n", "In addition, the use of Orange's add-on for enumerating frequent itemsets and association rules mining is also very powerful. The [documentation here](http://orange3-associate.readthedocs.org/en/latest/) outlines how to use custom data sets." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Supp Conf Rule\n", " 0.4 1.0 Cola -> Diapers\n", " 0.4 0.5 Diapers -> Cola\n", " 0.4 1.0 Cola -> Diapers Milk\n", " 0.4 1.0 Cola Diapers -> Milk\n", " 0.4 1.0 Cola Milk -> Diapers\n", "Supp Conf Rule\n", " 0.4 1.0 Cola -> Diapers\n", " 0.4 0.5 Diapers -> Cola\n", " 0.4 1.0 Cola -> Diapers Milk\n", " 0.4 1.0 Cola Diapers -> Milk\n", " 0.4 1.0 Cola Milk -> Diapers\n", "Supp Conf Rule\n", " 0.4 1.0 Cola -> Diapers\n", " 0.4 0.5 Diapers -> Cola\n", " 0.4 1.0 Cola -> Diapers Milk\n", " 0.4 1.0 Cola Diapers -> Milk\n", " 0.4 1.0 Cola Milk -> Diapers\n" ] } ], "source": [ "import Orange\n", "data = Orange.data.Table(\"market-basket.basket\")\n", "\n", "rules = Orange.associate.AssociationRulesSparseInducer(data, support=0.3)\n", "print \"%4s %4s %s\" % (\"Supp\", \"Conf\", \"Rule\")\n", "for r in rules[:5]:\n", " print \"%4.1f %4.1f %s\" % (r.support, r.confidence, r)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(0.40) Cola\n", "(0.40) Cola Diapers\n", "(0.40) Cola Diapers Milk\n", "(0.40) Cola Milk\n", "(0.60) Beer\n", "(0.40) Cola\n", "(0.40) Cola Diapers\n", "(0.40) Cola Diapers Milk\n", "(0.40) Cola Milk\n", "(0.60) Beer\n", "(0.40) Cola\n", "(0.40) Cola Diapers\n", "(0.40) Cola Diapers Milk\n", "(0.40) Cola Milk\n", "(0.60) Beer\n" ] } ], "source": [ "import Orange\n", "data = Orange.data.Table(\"market-basket.basket\")\n", "\n", "ind = Orange.associate.AssociationRulesSparseInducer(support=0.4, storeExamples = True)\n", "itemsets = ind.get_itemsets(data)\n", "for itemset, tids in itemsets[:5]:\n", " print \"(%4.2f) %s\" % (len(tids)/float(len(data)),\n", " \" \".join(data.domain[item].name for item in itemset))\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " supp conf\n", "0.500 1.000 fear -> surprise\n", "0.500 1.000 surprise -> fear\n", "0.500 1.000 fear -> surprise our\n", "0.500 1.000 fear surprise -> our\n", "0.500 1.000 fear our -> surprise\n", "0.500 1.000 surprise -> fear our\n", "0.500 1.000 surprise our -> fear\n", "0.500 0.714 our -> fear surprise\n", "0.500 1.000 fear -> our\n", "0.500 0.714 our -> fear\n", "0.500 1.000 surprise -> our\n", "0.500 0.714 our -> surprise\n", " supp conf\n", "0.500 1.000 fear -> surprise\n", "0.500 1.000 surprise -> fear\n", "0.500 1.000 fear -> surprise our\n", "0.500 1.000 fear surprise -> our\n", "0.500 1.000 fear our -> surprise\n", "0.500 1.000 surprise -> fear our\n", "0.500 1.000 surprise our -> fear\n", "0.500 0.714 our -> fear surprise\n", "0.500 1.000 fear -> our\n", "0.500 0.714 our -> fear\n", "0.500 1.000 surprise -> our\n", "0.500 0.714 our -> surprise\n", " supp conf\n", "0.500 1.000 fear -> surprise\n", "0.500 1.000 surprise -> fear\n", "0.500 1.000 fear -> surprise our\n", "0.500 1.000 fear surprise -> our\n", "0.500 1.000 fear our -> surprise\n", "0.500 1.000 surprise -> fear our\n", "0.500 1.000 surprise our -> fear\n", "0.500 0.714 our -> fear surprise\n", "0.500 1.000 fear -> our\n", "0.500 0.714 our -> fear\n", "0.500 1.000 surprise -> our\n", "0.500 0.714 our -> surprise\n" ] } ], "source": [ "import Orange\n", "data = Orange.data.Table(\"inquisition.basket\")\n", "\n", "rules = Orange.associate.AssociationRulesSparseInducer(data, support = 0.5)\n", "\n", "print \"%5s %5s\" % (\"supp\", \"conf\")\n", "for r in rules:\n", " print \"%5.3f %5.3f %s\" % (r.support, r.confidence, r)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "My favourite implementation of Apriori is in the [R] progamming langugage. This is an example of Apriori on the Adult dataset by setting the support and confidence.\n", "\n", " library(arules)\n", " library(printr)\n", " data(\"Adult\")\n", "\n", " rules <- apriori(Adult,\n", " parameter = list(support = 0.4, confidence = 0.7),\n", " appearance = list(rhs = c(\"race=White\"), default = \"lhs\"))\n", " rules.sorted <- sort(rules, by = \"lift\")\n", " top5.rules <- head(rules.sorted, 5)\n", " as(top5.rules, \"data.frame\")\n", "\n", "You can read more about the implementation here:\n", "\n", "https://cran.r-project.org/web/packages/arules/arules.pdf\n", "\n", "https://cran.r-project.org/web/packages/arules/vignettes/arules.pdf\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }