{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "The Apriori Algorithm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. The algorithm identifies the frequent individual items in the database and, as long as those itemsets appear sufficiently often in the database, extends them to larger itemsets. The frequent itemsets determined by Apriori can be used to determine association rules which highlight general trends in the database." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# (c) 2014 Everaldo Aguiar & Reid Johnson\n", "#\n", "# Modified from:\n", "# Marcel Caraciolo (https://gist.github.com/marcelcaraciolo/1423287)\n", "#\n", "# Functions to compute and extract association rules from a given frequent \n", "# itemset generated by the Apriori algorithm.\n", "\n", "def apriori(dataset, min_support=0.5, VERBOSE=False):\n", " \"\"\"Implements the Apriori algorithm.\n", "\n", " The Apriori algorithm will iteratively generate new candidate \n", " k-itemsets using the frequent (k-1)-itemsets found in the previous \n", " iteration.\n", "\n", " Parameters\n", " ----------\n", " dataset : list\n", " The dataset (a list of transactions) from which to generate \n", " candidate itemsets.\n", "\n", " min_support : float\n", " The minimum support threshold. Defaults to 0.5.\n", "\n", " Returns\n", " -------\n", " F : list\n", " The list of frequent itemsets.\n", "\n", " support_data : dict\n", " The support data for all candidate itemsets.\n", "\n", " References\n", " ----------\n", " .. [1] R. Agrawal, R. Srikant, \"Fast Algorithms for Mining Association \n", " Rules\", 1994.\n", "\n", " \"\"\"\n", " C1 = create_candidates(dataset)\n", " D = map(set, dataset)\n", " F1, support_data = support_prune(D, C1, min_support, VERBOSE=False) # prune candidate 1-itemsets\n", " F = [F1] # list of frequent itemsets; initialized to frequent 1-itemsets\n", " k = 2 # the itemset cardinality\n", " while (len(F[k - 2]) > 0):\n", " Ck = apriori_gen(F[k-2], k) # generate candidate itemsets\n", " Fk, supK = support_prune(D, Ck, min_support) # prune candidate itemsets\n", " support_data.update(supK) # update the support counts to reflect pruning\n", " F.append(Fk) # add the pruned candidate itemsets to the list of frequent itemsets\n", " k += 1\n", "\n", " if VERBOSE:\n", " # Print a list of all the frequent itemsets.\n", " for kset in F:\n", " for item in kset:\n", " print \"\" \\\n", " + \"{\" \\\n", " + \"\".join(str(i) + \", \" for i in iter(item)).rstrip(', ') \\\n", " + \"}\" \\\n", " + \": sup = \" + str(round(support_data[item], 3))\n", "\n", " return F, support_data\n", "\n", "def create_candidates(dataset, VERBOSE=False):\n", " \"\"\"Creates a list of candidate 1-itemsets from a list of transactions.\n", "\n", " Parameters\n", " ----------\n", " dataset : list\n", " The dataset (a list of transactions) from which to generate candidate \n", " itemsets.\n", "\n", " Returns\n", " -------\n", " The list of candidate itemsets (c1) passed as a frozenset (a set that is \n", " immutable and hashable).\n", " \"\"\"\n", " c1 = [] # list of all items in the database of transactions\n", " for transaction in dataset:\n", " for item in transaction:\n", " if not [item] in c1:\n", " c1.append([item])\n", " c1.sort()\n", "\n", " if VERBOSE:\n", " # Print a list of all the candidate items.\n", " print \"\" \\\n", " + \"{\" \\\n", " + \"\".join(str(i[0]) + \", \" for i in iter(c1)).rstrip(', ') \\\n", " + \"}\"\n", "\n", " # Map c1 to a frozenset because it will be the key of a dictionary.\n", " return map(frozenset, c1)\n", "\n", "def support_prune(dataset, candidates, min_support, VERBOSE=False):\n", " \"\"\"Returns all candidate itemsets that meet a minimum support threshold.\n", "\n", " By the apriori principle, if an itemset is frequent, then all of its \n", " subsets must also be frequent. As a result, we can perform support-based \n", " pruning to systematically control the exponential growth of candidate \n", " itemsets. Thus, itemsets that do not meet the minimum support level are \n", " pruned from the input list of itemsets (dataset).\n", "\n", " Parameters\n", " ----------\n", " dataset : list\n", " The dataset (a list of transactions) from which to generate candidate \n", " itemsets.\n", "\n", " candidates : frozenset\n", " The list of candidate itemsets.\n", "\n", " min_support : float\n", " The minimum support threshold.\n", "\n", " Returns\n", " -------\n", " retlist : list\n", " The list of frequent itemsets.\n", "\n", " support_data : dict\n", " The support data for all candidate itemsets.\n", " \"\"\"\n", " sscnt = {} # set for support counts\n", " for tid in dataset:\n", " for can in candidates:\n", " if can.issubset(tid):\n", " sscnt.setdefault(can, 0)\n", " sscnt[can] += 1\n", "\n", " num_items = float(len(dataset)) # total number of transactions in the dataset\n", " retlist = [] # array for unpruned itemsets\n", " support_data = {} # set for support data for corresponding itemsets\n", " for key in sscnt:\n", " # Calculate the support of itemset key.\n", " support = sscnt[key] / num_items\n", " if support >= min_support:\n", " retlist.insert(0, key)\n", " support_data[key] = support\n", "\n", " # Print a list of the pruned itemsets.\n", " if VERBOSE:\n", " for kset in retlist:\n", " for item in kset:\n", " print \"{\" + str(item) + \"}\"\n", " print\n", " for key in sscnt:\n", " print \"\" \\\n", " + \"{\" \\\n", " + \"\".join([str(i) + \", \" for i in iter(key)]).rstrip(', ') \\\n", " + \"}\" \\\n", " + \": sup = \" + str(support_data[key])\n", "\n", " return retlist, support_data\n", "\n", "def apriori_gen(freq_sets, k):\n", " \"\"\"Generates candidate itemsets (via the F_k-1 x F_k-1 method).\n", "\n", " This operation generates new candidate k-itemsets based on the frequent \n", " (k-1)-itemsets found in the previous iteration. The candidate generation \n", " procedure merges a pair of frequent (k-1)-itemsets only if their first k-2 \n", " items are identical.\n", "\n", " Parameters\n", " ----------\n", " freq_sets : list\n", " The list of frequent (k-1)-itemsets.\n", "\n", " k : integer\n", " The cardinality of the current itemsets begin evaluated.\n", "\n", " Returns\n", " -------\n", " retlist : list\n", " The list of merged frequent itemsets.\n", " \"\"\"\n", " retList = [] # list of merged frequent itemsets\n", " lenLk = len(freq_sets) # number of frequent itemsets\n", " for i in range(lenLk):\n", " for j in range(i+1, lenLk):\n", " a=list(freq_sets[i])\n", " b=list(freq_sets[j])\n", " a.sort()\n", " b.sort()\n", " F1 = a[:k-2] # first k-2 items of freq_sets[i]\n", " F2 = b[:k-2] # first k-2 items of freq_sets[j]\n", "\n", " if F1 == F2: # if the first k-2 items are identical\n", " # Merge the frequent itemsets.\n", " retList.append(freq_sets[i] | freq_sets[j])\n", "\n", " return retList\n", "\n", "def rules_from_conseq(freq_set, H, support_data, rules, min_confidence=0.7):\n", " \"\"\"Generates a set of candidate rules.\n", "\n", " Parameters\n", " ----------\n", " freq_set : frozenset\n", " The complete list of frequent itemsets.\n", "\n", " H : list\n", " A list of frequent itemsets (of a particular length).\n", "\n", " support_data : dict\n", " The support data for all candidate itemsets.\n", "\n", " rules : list\n", " A potentially incomplete set of candidate rules above the minimum \n", " confidence threshold.\n", "\n", " min_confidence : float\n", " The minimum confidence threshold. Defaults to 0.7.\n", " \"\"\"\n", " m = len(H[0])\n", " if (len(freq_set) > (m+1)):\n", " Hmp1 = apriori_gen(H, m+1) # generate candidate itemsets\n", " Hmp1 = calc_confidence(freq_set, Hmp1, support_data, rules, min_confidence)\n", " if len(Hmp1) > 1:\n", " # If there are candidate rules above the minimum confidence \n", " # threshold, recurse on the list of these candidate rules.\n", " rules_from_conseq(freq_set, Hmp1, support_data, rules, min_confidence)\n", "\n", "def calc_confidence(freq_set, H, support_data, rules, min_confidence=0.7, VERBOSE=False):\n", " \"\"\"Evaluates the generated rules.\n", "\n", " One measurement for quantifying the goodness of association rules is \n", " confidence. The confidence for a rule 'P implies H' (P -> H) is defined as \n", " the support for P and H divided by the support for P \n", " (support (P|H) / support(P)), where the | symbol denotes the set union \n", " (thus P|H means all the items in set P or in set H).\n", "\n", " To calculate the confidence, we iterate through the frequent itemsets and \n", " associated support data. For each frequent itemset, we divide the support \n", " of the itemset by the support of the antecedent (left-hand-side of the \n", " rule).\n", "\n", " Parameters\n", " ----------\n", " freq_set : frozenset\n", " The complete list of frequent itemsets.\n", "\n", " H : list\n", " A frequent itemset.\n", "\n", " min_support : float\n", " The minimum support threshold.\n", "\n", " rules : list\n", " A potentially incomplete set of candidate rules above the minimum \n", " confidence threshold.\n", "\n", " min_confidence : float\n", " The minimum confidence threshold. Defaults to 0.7.\n", "\n", " Returns\n", " -------\n", " pruned_H : list\n", " The list of candidate rules above the minimum confidence threshold.\n", " \"\"\"\n", " pruned_H = [] # list of candidate rules above the minimum confidence threshold\n", " for conseq in H: # iterate over the frequent itemsets\n", " conf = support_data[freq_set] / support_data[freq_set - conseq]\n", " if conf >= min_confidence:\n", "\n", " rules.append((freq_set - conseq, conseq, conf))\n", " pruned_H.append(conseq)\n", "\n", " if VERBOSE:\n", " print \"\" \\\n", " + \"{\" \\\n", " + \"\".join([str(i) + \", \" for i in iter(freq_set-conseq)]).rstrip(', ') \\\n", " + \"}\" \\\n", " + \" ---> \" \\\n", " + \"{\" \\\n", " + \"\".join([str(i) + \", \" for i in iter(conseq)]).rstrip(', ') \\\n", " + \"}\" \\\n", " + \": conf = \" + str(round(conf, 3)) \\\n", " + \", sup = \" + str(round(support_data[freq_set], 3))\n", "\n", " return pruned_H\n", "\n", "def generate_rules(F, support_data, min_confidence=0.7, VERBOSE=False):\n", " \"\"\"Generates a set of candidate rules from a list of frequent itemsets.\n", "\n", " For each frequent itemset, we calculate the confidence of using a\n", " particular item as the rule consequent (right-hand-side of the rule). By \n", " testing and merging the remaining rules, we recursively create a list of \n", " pruned rules.\n", "\n", " Parameters\n", " ----------\n", " F : list\n", " A list of frequent itemsets.\n", "\n", " support_data : dict\n", " The corresponding support data for the frequent itemsets (L).\n", "\n", " min_confidence : float\n", " The minimum confidence threshold. Defaults to 0.7.\n", "\n", " Returns\n", " -------\n", " rules : list\n", " The list of candidate rules above the minimum confidence threshold.\n", " \"\"\"\n", " rules = []\n", " for i in range(1, len(F)):\n", " for freq_set in F[i]:\n", " H1 = [frozenset([item]) for item in freq_set]\n", "\n", " if (i > 1):\n", " rules_from_conseq(freq_set, H1, support_data, rules, min_confidence)\n", " else:\n", " calc_confidence(freq_set, H1, support_data, rules, min_confidence, VERBOSE=VERBOSE)\n", "\n", " return rules" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we load an example market basket transactions dataset (a list of lists), map it to a 'set' datatype (for programmatic reasons), and print the transactions. We import and use pprint to format the output." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pprint\n", "\n", "def load_dataset():\n", " \"\"\"Loads an example of market basket transactions for testing purposes.\n", "\n", " Returns\n", " -------\n", " A list (database) of lists (transactions). Each element of a transaction \n", " is an item.\n", " \"\"\"\n", " return [['Bread', 'Milk'], \n", " ['Bread', 'Diapers', 'Beer', 'Eggs'], \n", " ['Milk', 'Diapers', 'Beer', 'Coke'], \n", " ['Bread', 'Milk', 'Diapers', 'Beer'], \n", " ['Bread', 'Milk', 'Diapers', 'Coke']]\n", "\n", "dataset = load_dataset() # list of transactions; each transaction is a list of items\n", "D = map(set, dataset) # set of transactions; each transaction is a list of items\n", "\n", "pprint.pprint(dataset)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[['Bread', 'Milk'],\n", " ['Bread', 'Diapers', 'Beer', 'Eggs'],\n", " ['Milk', 'Diapers', 'Beer', 'Coke'],\n", " ['Bread', 'Milk', 'Diapers', 'Beer'],\n", " ['Bread', 'Milk', 'Diapers', 'Coke']]\n" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several aspects to the Apriori algorithm. First, the algorithm makes an initial pass over the dataset to determine the support of each item (here, performed during the execution of the support_prune function on the candidate 1-itemsets returned by the create_candidates function). Upon completion of this step, the set of all frequent 1-itemsets will be known. Next, the algorithm iteratively generates new candidate k-itemsets using the frequent (k-1)-itemsets found in the previous iteration (here, via the apriori_gen function).\n", "\n", "Using our code, we could begin this process by creating the initial candidates (and mapping our original sets to a Python set)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Generate candidate itemsets.\n", "C1 = create_candidates(dataset, VERBOSE=True) # candidate 1-itemsets" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "{Beer, Bread, Coke, Diapers, Eggs, Milk}\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we could generate the frequent 1-itemsets by pruning candidate 1-itemsets that do not meet the minimum support. Here, we print the frequent 1-itemsets and the list of support values for all 1-itemsets. Notice that the 1-itemsets with corresponding support values below the minimum support have been pruned." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Prune candidate 1-itemsets via support-based pruning to generate frequent 1-itemsets.\n", "F1, support_data = support_prune(D, C1, 0.6, VERBOSE=True)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "{Milk}\n", "{Bread}\n", "{Beer}\n", "{Diapers}\n", "\n", "{Eggs}: sup = 0.2\n", "{Diapers}: sup = 0.8\n", "{Beer}: sup = 0.6\n", "{Bread}: sup = 0.8\n", "{Milk}: sup = 0.8\n", "{Coke}: sup = 0.4\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we could iteratively generate the remaining frequent itemsets via the apriori_gen function. However, our code wraps the entire process into one function (apriori), which internally executes create_candidates, support_prune, and apriori_gen. We can simply input the initial dataset into this function (along with a minimum support threshold) and it will return a list of all the frequent itemsets:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Generate all the frequent itemsets using the Apriori algorithm.\n", "F, support_data = apriori(dataset, min_support=0.6, VERBOSE=True)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "{Milk}: sup = 0.8\n", "{Bread}: sup = 0.8\n", "{Beer}: sup = 0.6\n", "{Diapers}: sup = 0.8\n", "{Beer, Diapers}: sup = 0.6\n", "{Diapers, Bread}: sup = 0.6\n", "{Milk, Diapers}: sup = 0.6\n", "{Milk, Bread}: sup = 0.6\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given a frequent itemset (here, extracted by the Apriori algorithm), we can generate the association rules with high support and confidence (via the generate_rules function):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Generate the association rules from a list of frequent itemsets.\n", "H = generate_rules(F, support_data, min_confidence=0.8, VERBOSE=True)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "{Beer} ---> {Diapers}: conf = 1.0, sup = 0.6\n" ] } ], "prompt_number": 6 } ], "metadata": {} } ] }