{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Frequent Itemsets via Apriori Algorithm"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Apriori function to extract frequent itemsets for association rule mining"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"> from mlxtend.frequent_patterns import apriori"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Overview"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Apriori is a popular algorithm [1] for extracting frequent itemsets with applications in association rule learning. The apriori algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. A itemset is considered as \"frequent\" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur togehter in at least 50% of all transactions in the database."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## References\n",
"\n",
"[1] Agrawal, Rakesh, and Ramakrishnan Srikant. \"[Fast algorithms for mining association rules](https://www.it.uu.se/edu/course/homepage/infoutv/ht08/vldb94_rj.pdf).\" Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Example 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"source": [
"The `apriori` function expects data in a one-hot encoded pandas DataFrame.\n",
"Suppose we have the following transaction data:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],\n",
" ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],\n",
" ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],\n",
" ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],\n",
" ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We can transform it into the right format via the `OnehotTransactions` encoder as follows:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" Apple | \n",
" Corn | \n",
" Dill | \n",
" Eggs | \n",
" Ice cream | \n",
" Kidney Beans | \n",
" Milk | \n",
" Nutmeg | \n",
" Onion | \n",
" Unicorn | \n",
" Yogurt | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" | 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" | 2 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" | 3 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" | 4 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Apple Corn Dill Eggs Ice cream Kidney Beans Milk Nutmeg Onion \\\n",
"0 0 0 0 1 0 1 1 1 1 \n",
"1 0 0 1 1 0 1 0 1 1 \n",
"2 1 0 0 1 0 1 1 0 0 \n",
"3 0 1 0 0 0 1 1 0 0 \n",
"4 0 1 0 1 1 1 0 0 1 \n",
"\n",
" Unicorn Yogurt \n",
"0 0 1 \n",
"1 0 1 \n",
"2 0 0 \n",
"3 1 1 \n",
"4 0 0 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"from mlxtend.preprocessing import OnehotTransactions\n",
"\n",
"oht = OnehotTransactions()\n",
"oht_ary = oht.fit(dataset).transform(dataset)\n",
"df = pd.DataFrame(oht_ary, columns=oht.columns_)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Now, let us return the items and itemsets with at least 60% support:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" support | \n",
" itemsets | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0.8 | \n",
" [3] | \n",
"
\n",
" \n",
" | 1 | \n",
" 1.0 | \n",
" [5] | \n",
"
\n",
" \n",
" | 2 | \n",
" 0.6 | \n",
" [6] | \n",
"
\n",
" \n",
" | 3 | \n",
" 0.6 | \n",
" [8] | \n",
"
\n",
" \n",
" | 4 | \n",
" 0.6 | \n",
" [10] | \n",
"
\n",
" \n",
" | 5 | \n",
" 0.8 | \n",
" [3, 5] | \n",
"
\n",
" \n",
" | 6 | \n",
" 0.6 | \n",
" [3, 8] | \n",
"
\n",
" \n",
" | 7 | \n",
" 0.6 | \n",
" [5, 6] | \n",
"
\n",
" \n",
" | 8 | \n",
" 0.6 | \n",
" [5, 8] | \n",
"
\n",
" \n",
" | 9 | \n",
" 0.6 | \n",
" [5, 10] | \n",
"
\n",
" \n",
" | 10 | \n",
" 0.6 | \n",
" [3, 5, 8] | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" support itemsets\n",
"0 0.8 [3]\n",
"1 1.0 [5]\n",
"2 0.6 [6]\n",
"3 0.6 [8]\n",
"4 0.6 [10]\n",
"5 0.8 [3, 5]\n",
"6 0.6 [3, 8]\n",
"7 0.6 [5, 6]\n",
"8 0.6 [5, 8]\n",
"9 0.6 [5, 10]\n",
"10 0.6 [3, 5, 8]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from mlxtend.frequent_patterns import apriori\n",
"\n",
"apriori(df, min_support=0.6)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"By default, `apriori` returns the column indices of the items, which may be useful in downstream operations such as association rule mining. For better readability, we can set `use_colnames=True` to convert these integer values into the respective item names: "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" support | \n",
" itemsets | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0.8 | \n",
" [Eggs] | \n",
"
\n",
" \n",
" | 1 | \n",
" 1.0 | \n",
" [Kidney Beans] | \n",
"
\n",
" \n",
" | 2 | \n",
" 0.6 | \n",
" [Milk] | \n",
"
\n",
" \n",
" | 3 | \n",
" 0.6 | \n",
" [Onion] | \n",
"
\n",
" \n",
" | 4 | \n",
" 0.6 | \n",
" [Yogurt] | \n",
"
\n",
" \n",
" | 5 | \n",
" 0.8 | \n",
" [Eggs, Kidney Beans] | \n",
"
\n",
" \n",
" | 6 | \n",
" 0.6 | \n",
" [Eggs, Onion] | \n",
"
\n",
" \n",
" | 7 | \n",
" 0.6 | \n",
" [Kidney Beans, Milk] | \n",
"
\n",
" \n",
" | 8 | \n",
" 0.6 | \n",
" [Kidney Beans, Onion] | \n",
"
\n",
" \n",
" | 9 | \n",
" 0.6 | \n",
" [Kidney Beans, Yogurt] | \n",
"
\n",
" \n",
" | 10 | \n",
" 0.6 | \n",
" [Eggs, Kidney Beans, Onion] | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" support itemsets\n",
"0 0.8 [Eggs]\n",
"1 1.0 [Kidney Beans]\n",
"2 0.6 [Milk]\n",
"3 0.6 [Onion]\n",
"4 0.6 [Yogurt]\n",
"5 0.8 [Eggs, Kidney Beans]\n",
"6 0.6 [Eggs, Onion]\n",
"7 0.6 [Kidney Beans, Milk]\n",
"8 0.6 [Kidney Beans, Onion]\n",
"9 0.6 [Kidney Beans, Yogurt]\n",
"10 0.6 [Eggs, Kidney Beans, Onion]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"apriori(df, min_support=0.6, use_colnames=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Example 2"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The advantage of working with pandas `DataFrames` is that we can use its convenient features to filter the results. For instance, let's assume we are only interested in itemsets of length 2 that have a support of at least 80 percent. First, we create the frequent itemsets via `apriori` and add a new column that stores the length of each itemset:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" support | \n",
" itemsets | \n",
" length | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0.8 | \n",
" [Eggs] | \n",
" 1 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1.0 | \n",
" [Kidney Beans] | \n",
" 1 | \n",
"
\n",
" \n",
" | 2 | \n",
" 0.6 | \n",
" [Milk] | \n",
" 1 | \n",
"
\n",
" \n",
" | 3 | \n",
" 0.6 | \n",
" [Onion] | \n",
" 1 | \n",
"
\n",
" \n",
" | 4 | \n",
" 0.6 | \n",
" [Yogurt] | \n",
" 1 | \n",
"
\n",
" \n",
" | 5 | \n",
" 0.8 | \n",
" [Eggs, Kidney Beans] | \n",
" 2 | \n",
"
\n",
" \n",
" | 6 | \n",
" 0.6 | \n",
" [Eggs, Onion] | \n",
" 2 | \n",
"
\n",
" \n",
" | 7 | \n",
" 0.6 | \n",
" [Kidney Beans, Milk] | \n",
" 2 | \n",
"
\n",
" \n",
" | 8 | \n",
" 0.6 | \n",
" [Kidney Beans, Onion] | \n",
" 2 | \n",
"
\n",
" \n",
" | 9 | \n",
" 0.6 | \n",
" [Kidney Beans, Yogurt] | \n",
" 2 | \n",
"
\n",
" \n",
" | 10 | \n",
" 0.6 | \n",
" [Eggs, Kidney Beans, Onion] | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" support itemsets length\n",
"0 0.8 [Eggs] 1\n",
"1 1.0 [Kidney Beans] 1\n",
"2 0.6 [Milk] 1\n",
"3 0.6 [Onion] 1\n",
"4 0.6 [Yogurt] 1\n",
"5 0.8 [Eggs, Kidney Beans] 2\n",
"6 0.6 [Eggs, Onion] 2\n",
"7 0.6 [Kidney Beans, Milk] 2\n",
"8 0.6 [Kidney Beans, Onion] 2\n",
"9 0.6 [Kidney Beans, Yogurt] 2\n",
"10 0.6 [Eggs, Kidney Beans, Onion] 3"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)\n",
"frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))\n",
"frequent_itemsets"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Then, we can select the results that satisfy our desired criteria as follows:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" support | \n",
" itemsets | \n",
" length | \n",
"
\n",
" \n",
" \n",
" \n",
" | 5 | \n",
" 0.8 | \n",
" [Eggs, Kidney Beans] | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" support itemsets length\n",
"5 0.8 [Eggs, Kidney Beans] 2"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frequent_itemsets[ (frequent_itemsets['length'] == 2) &\n",
" (frequent_itemsets['support'] >= 0.8) ]"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## API"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"## apriori\n",
"\n",
"*apriori(df, min_support=0.5, use_colnames=False)*\n",
"\n",
"Get frequent itemsets from a one-hot DataFrame\n",
"**Parameters**\n",
"\n",
"- `df` : pandas DataFrame\n",
"\n",
" pandas DataFrame in one-hot encoded format. For example\n",
" ```\n",
" Apple Bananas Beer Chicken Milk Rice\n",
" 0 1 0 1 1 0 1\n",
" 1 1 0 1 0 0 1\n",
" 2 1 0 1 0 0 0\n",
" 3 1 1 0 0 0 0\n",
" 4 0 0 1 1 1 1\n",
" 5 0 0 1 0 1 1\n",
" 6 0 0 1 0 1 0\n",
" 7 1 1 0 0 0 0\n",
" ```\n",
"\n",
"- `min_support` : float (default: 0.5)\n",
"\n",
" A float between 0 and 1 for minumum support of the itemsets returned.\n",
" The support is computed as the fraction\n",
" transactions_where_item(s)_occur / total_transactions.\n",
"\n",
"- `use_colnames` : bool (default: False)\n",
"\n",
" If true, uses the DataFrames' column names in the returned DataFrame\n",
" instead of column indices.\n",
"**Returns**\n",
"\n",
"pandas DataFrame with columns ['support', 'itemsets'] of all itemsets\n",
" that are >= min_support.\n",
"\n",
"\n"
]
}
],
"source": [
"with open('../../api_modules/mlxtend.frequent_patterns/apriori.md', 'r') as f:\n",
" print(f.read())"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
}
},
"nbformat": 4,
"nbformat_minor": 0
}