{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Frequent Itemsets via Apriori Algorithm" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Apriori function to extract frequent itemsets for association rule mining" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "> from mlxtend.frequent_patterns import apriori" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Apriori is a popular algorithm [1] for extracting frequent itemsets with applications in association rule learning. The apriori algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. A itemset is considered as \"frequent\" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur togehter in at least 50% of all transactions in the database." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## References\n", "\n", "[1] Agrawal, Rakesh, and Ramakrishnan Srikant. \"[Fast algorithms for mining association rules](https://www.it.uu.se/edu/course/homepage/infoutv/ht08/vldb94_rj.pdf).\" Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Example 1" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "deletable": true, "editable": true }, "source": [ "The `apriori` function expects data in a one-hot encoded pandas DataFrame.\n", "Suppose we have the following transaction data:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],\n", " ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],\n", " ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],\n", " ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],\n", " ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We can transform it into the right format via the `OnehotTransactions` encoder as follows:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AppleCornDillEggsIce creamKidney BeansMilkNutmegOnionUnicornYogurt
000010111101
100110101101
210010110000
301000110011
401011100100
\n", "
" ], "text/plain": [ " Apple Corn Dill Eggs Ice cream Kidney Beans Milk Nutmeg Onion \\\n", "0 0 0 0 1 0 1 1 1 1 \n", "1 0 0 1 1 0 1 0 1 1 \n", "2 1 0 0 1 0 1 1 0 0 \n", "3 0 1 0 0 0 1 1 0 0 \n", "4 0 1 0 1 1 1 0 0 1 \n", "\n", " Unicorn Yogurt \n", "0 0 1 \n", "1 0 1 \n", "2 0 0 \n", "3 1 1 \n", "4 0 0 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from mlxtend.preprocessing import OnehotTransactions\n", "\n", "oht = OnehotTransactions()\n", "oht_ary = oht.fit(dataset).transform(dataset)\n", "df = pd.DataFrame(oht_ary, columns=oht.columns_)\n", "df" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Now, let us return the items and itemsets with at least 60% support:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
supportitemsets
00.8[3]
11.0[5]
20.6[6]
30.6[8]
40.6[10]
50.8[3, 5]
60.6[3, 8]
70.6[5, 6]
80.6[5, 8]
90.6[5, 10]
100.6[3, 5, 8]
\n", "
" ], "text/plain": [ " support itemsets\n", "0 0.8 [3]\n", "1 1.0 [5]\n", "2 0.6 [6]\n", "3 0.6 [8]\n", "4 0.6 [10]\n", "5 0.8 [3, 5]\n", "6 0.6 [3, 8]\n", "7 0.6 [5, 6]\n", "8 0.6 [5, 8]\n", "9 0.6 [5, 10]\n", "10 0.6 [3, 5, 8]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mlxtend.frequent_patterns import apriori\n", "\n", "apriori(df, min_support=0.6)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "By default, `apriori` returns the column indices of the items, which may be useful in downstream operations such as association rule mining. For better readability, we can set `use_colnames=True` to convert these integer values into the respective item names: " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
supportitemsets
00.8[Eggs]
11.0[Kidney Beans]
20.6[Milk]
30.6[Onion]
40.6[Yogurt]
50.8[Eggs, Kidney Beans]
60.6[Eggs, Onion]
70.6[Kidney Beans, Milk]
80.6[Kidney Beans, Onion]
90.6[Kidney Beans, Yogurt]
100.6[Eggs, Kidney Beans, Onion]
\n", "
" ], "text/plain": [ " support itemsets\n", "0 0.8 [Eggs]\n", "1 1.0 [Kidney Beans]\n", "2 0.6 [Milk]\n", "3 0.6 [Onion]\n", "4 0.6 [Yogurt]\n", "5 0.8 [Eggs, Kidney Beans]\n", "6 0.6 [Eggs, Onion]\n", "7 0.6 [Kidney Beans, Milk]\n", "8 0.6 [Kidney Beans, Onion]\n", "9 0.6 [Kidney Beans, Yogurt]\n", "10 0.6 [Eggs, Kidney Beans, Onion]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "apriori(df, min_support=0.6, use_colnames=True)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Example 2" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The advantage of working with pandas `DataFrames` is that we can use its convenient features to filter the results. For instance, let's assume we are only interested in itemsets of length 2 that have a support of at least 80 percent. First, we create the frequent itemsets via `apriori` and add a new column that stores the length of each itemset:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
supportitemsetslength
00.8[Eggs]1
11.0[Kidney Beans]1
20.6[Milk]1
30.6[Onion]1
40.6[Yogurt]1
50.8[Eggs, Kidney Beans]2
60.6[Eggs, Onion]2
70.6[Kidney Beans, Milk]2
80.6[Kidney Beans, Onion]2
90.6[Kidney Beans, Yogurt]2
100.6[Eggs, Kidney Beans, Onion]3
\n", "
" ], "text/plain": [ " support itemsets length\n", "0 0.8 [Eggs] 1\n", "1 1.0 [Kidney Beans] 1\n", "2 0.6 [Milk] 1\n", "3 0.6 [Onion] 1\n", "4 0.6 [Yogurt] 1\n", "5 0.8 [Eggs, Kidney Beans] 2\n", "6 0.6 [Eggs, Onion] 2\n", "7 0.6 [Kidney Beans, Milk] 2\n", "8 0.6 [Kidney Beans, Onion] 2\n", "9 0.6 [Kidney Beans, Yogurt] 2\n", "10 0.6 [Eggs, Kidney Beans, Onion] 3" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)\n", "frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))\n", "frequent_itemsets" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Then, we can select the results that satisfy our desired criteria as follows:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
supportitemsetslength
50.8[Eggs, Kidney Beans]2
\n", "
" ], "text/plain": [ " support itemsets length\n", "5 0.8 [Eggs, Kidney Beans] 2" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frequent_itemsets[ (frequent_itemsets['length'] == 2) &\n", " (frequent_itemsets['support'] >= 0.8) ]" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## API" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "## apriori\n", "\n", "*apriori(df, min_support=0.5, use_colnames=False)*\n", "\n", "Get frequent itemsets from a one-hot DataFrame\n", "**Parameters**\n", "\n", "- `df` : pandas DataFrame\n", "\n", " pandas DataFrame in one-hot encoded format. For example\n", " ```\n", " Apple Bananas Beer Chicken Milk Rice\n", " 0 1 0 1 1 0 1\n", " 1 1 0 1 0 0 1\n", " 2 1 0 1 0 0 0\n", " 3 1 1 0 0 0 0\n", " 4 0 0 1 1 1 1\n", " 5 0 0 1 0 1 1\n", " 6 0 0 1 0 1 0\n", " 7 1 1 0 0 0 0\n", " ```\n", "\n", "- `min_support` : float (default: 0.5)\n", "\n", " A float between 0 and 1 for minumum support of the itemsets returned.\n", " The support is computed as the fraction\n", " transactions_where_item(s)_occur / total_transactions.\n", "\n", "- `use_colnames` : bool (default: False)\n", "\n", " If true, uses the DataFrames' column names in the returned DataFrame\n", " instead of column indices.\n", "**Returns**\n", "\n", "pandas DataFrame with columns ['support', 'itemsets'] of all itemsets\n", " that are >= min_support.\n", "\n", "\n" ] } ], "source": [ "with open('../../api_modules/mlxtend.frequent_patterns/apriori.md', 'r') as f:\n", " print(f.read())" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 0 }