{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Association Rules Generation from Frequent Itemsets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Function to generate association rules from frequent itemsets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> from mlxtend.frequent_patterns import association_rules" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rule generation is a common task in the mining of frequent patterns. _An association rule is an implication expression of the form $X \\rightarrow Y$, where $X$ and $Y$ are disjoint itemsets_ [1]. A more concrete example based on consumer behaviour would be $\\{Diapers\\} \\rightarrow \\{Beer\\}$ suggesting that people who buy diapers are also likely to buy beer. To evaluate the \"interest\" of such an association rule, different metrics have been developed. The current implementation make use of the `confidence` and `lift` metrics. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[1] Tan, Steinbach, Kumar. Introduction to Data Mining. Pearson New International Edition. Harlow: Pearson Education Ltd., 2014. (pp. 327-414)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "The `generate_rules` takes dataframes of frequent itemsets as produced by the `apriori` function in *mlxtend.association*. To demonstrate the usage of the `generate_rules` method, we first create a pandas `DataFrame` of frequent itemsets as generated by the [`apriori`](./apriori.md) function:\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
supportitemsets
00.8[Eggs]
11.0[Kidney Beans]
20.6[Milk]
30.6[Onion]
40.6[Yogurt]
50.8[Eggs, Kidney Beans]
60.6[Eggs, Onion]
70.6[Kidney Beans, Milk]
80.6[Kidney Beans, Onion]
90.6[Kidney Beans, Yogurt]
100.6[Eggs, Kidney Beans, Onion]
\n", "
" ], "text/plain": [ " support itemsets\n", "0 0.8 [Eggs]\n", "1 1.0 [Kidney Beans]\n", "2 0.6 [Milk]\n", "3 0.6 [Onion]\n", "4 0.6 [Yogurt]\n", "5 0.8 [Eggs, Kidney Beans]\n", "6 0.6 [Eggs, Onion]\n", "7 0.6 [Kidney Beans, Milk]\n", "8 0.6 [Kidney Beans, Onion]\n", "9 0.6 [Kidney Beans, Yogurt]\n", "10 0.6 [Eggs, Kidney Beans, Onion]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from mlxtend.preprocessing import OnehotTransactions\n", "from mlxtend.frequent_patterns import apriori\n", "\n", "\n", "dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],\n", " ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],\n", " ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],\n", " ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],\n", " ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]\n", "\n", "oht = OnehotTransactions()\n", "oht_ary = oht.fit(dataset).transform(dataset)\n", "df = pd.DataFrame(oht_ary, columns=oht.columns_)\n", "frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)\n", "\n", "frequent_itemsets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `generate_rules()` function allows you to (1) specify your metric of interest and (2) the according threshold. Currently implemented measures are **confidence** and **lift**. Let's say you are interesting in rules derived from the frequent itemsets only if the level of confidence is above the 90 percent threshold (`min_threshold=0.9`):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
antecedantsconsequentssupportconfidencelift
0(Eggs)(Kidney Beans)0.81.01.00
1(Onion)(Eggs)0.61.01.25
2(Milk)(Kidney Beans)0.61.01.00
3(Onion)(Kidney Beans)0.61.01.00
4(Yogurt)(Kidney Beans)0.61.01.00
5(Eggs, Onion)(Kidney Beans)0.61.01.00
6(Kidney Beans, Onion)(Eggs)0.61.01.25
7(Onion)(Eggs, Kidney Beans)0.61.01.25
\n", "
" ], "text/plain": [ " antecedants consequents support confidence lift\n", "0 (Eggs) (Kidney Beans) 0.8 1.0 1.00\n", "1 (Onion) (Eggs) 0.6 1.0 1.25\n", "2 (Milk) (Kidney Beans) 0.6 1.0 1.00\n", "3 (Onion) (Kidney Beans) 0.6 1.0 1.00\n", "4 (Yogurt) (Kidney Beans) 0.6 1.0 1.00\n", "5 (Eggs, Onion) (Kidney Beans) 0.6 1.0 1.00\n", "6 (Kidney Beans, Onion) (Eggs) 0.6 1.0 1.25\n", "7 (Onion) (Eggs, Kidney Beans) 0.6 1.0 1.25" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mlxtend.frequent_patterns import association_rules\n", "\n", "association_rules(frequent_itemsets, metric=\"confidence\", min_threshold=0.9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are interested in rules fulfilling a different interest metric, you can simply adjust the parameters. E.g. if you are interested only in rules that have a lift score of >= 1.2, you would do the following:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
antecedantsconsequentssupportconfidencelift
0(Eggs)(Onion)0.80.751.25
1(Onion)(Eggs)0.61.001.25
2(Eggs, Kidney Beans)(Onion)0.80.751.25
3(Kidney Beans, Onion)(Eggs)0.61.001.25
4(Eggs)(Kidney Beans, Onion)0.80.751.25
5(Onion)(Eggs, Kidney Beans)0.61.001.25
\n", "
" ], "text/plain": [ " antecedants consequents support confidence lift\n", "0 (Eggs) (Onion) 0.8 0.75 1.25\n", "1 (Onion) (Eggs) 0.6 1.00 1.25\n", "2 (Eggs, Kidney Beans) (Onion) 0.8 0.75 1.25\n", "3 (Kidney Beans, Onion) (Eggs) 0.6 1.00 1.25\n", "4 (Eggs) (Kidney Beans, Onion) 0.8 0.75 1.25\n", "5 (Onion) (Eggs, Kidney Beans) 0.6 1.00 1.25" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rules = association_rules(frequent_itemsets, metric=\"lift\", min_threshold=1.2)\n", "rules" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas `DataFrames` make it easy to filter the results further. Let's say we are ony interested in rules that satisfy the following criteria:\n", "\n", "1. at least 2 antecedants\n", "2. a confidence > 0.75\n", "3. a lift score > 1.2\n", "\n", "We could compute the antecedent length as follows:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
antecedantsconsequentssupportconfidenceliftantecedant_len
0(Eggs)(Onion)0.80.751.251
1(Onion)(Eggs)0.61.001.251
2(Eggs, Kidney Beans)(Onion)0.80.751.252
3(Kidney Beans, Onion)(Eggs)0.61.001.252
4(Eggs)(Kidney Beans, Onion)0.80.751.251
5(Onion)(Eggs, Kidney Beans)0.61.001.251
\n", "
" ], "text/plain": [ " antecedants consequents support confidence lift \\\n", "0 (Eggs) (Onion) 0.8 0.75 1.25 \n", "1 (Onion) (Eggs) 0.6 1.00 1.25 \n", "2 (Eggs, Kidney Beans) (Onion) 0.8 0.75 1.25 \n", "3 (Kidney Beans, Onion) (Eggs) 0.6 1.00 1.25 \n", "4 (Eggs) (Kidney Beans, Onion) 0.8 0.75 1.25 \n", "5 (Onion) (Eggs, Kidney Beans) 0.6 1.00 1.25 \n", "\n", " antecedant_len \n", "0 1 \n", "1 1 \n", "2 2 \n", "3 2 \n", "4 1 \n", "5 1 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rules[\"antecedant_len\"] = rules[\"antecedants\"].apply(lambda x: len(x))\n", "rules" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we can use pandas' selection syntax as shown below:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
antecedantsconsequentssupportconfidenceliftantecedant_len
3(Kidney Beans, Onion)(Eggs)0.61.01.252
\n", "
" ], "text/plain": [ " antecedants consequents support confidence lift \\\n", "3 (Kidney Beans, Onion) (Eggs) 0.6 1.0 1.25 \n", "\n", " antecedant_len \n", "3 2 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rules[ (rules['antecedant_len'] >= 2) &\n", " (rules['confidence'] > 0.75) &\n", " (rules['lift'] > 1.2) ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## API" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "## association_rules\n", "\n", "*association_rules(df, metric='confidence', min_threshold=0.8)*\n", "\n", "Generates a DataFrame of association rules including the\n", "metrics 'score', 'confidence', and 'lift'\n", "\n", "**Parameters**\n", "\n", "- `df` : pandas DataFrame\n", "\n", " pandas DataFrame of frequent itemsets\n", " with columns ['support', 'itemsets']\n", "\n", "- `metric` : string (default: 'confidence')\n", "\n", " Metric to evaluate if a rule is of interest.\n", " Supported metrics are 'confidence' and 'lift'\n", "\n", "- `min_threshold` : float (default: 0.8)\n", "\n", " Minimal threshold for the evaluation metric\n", " to decide whether a candidate rule is of interest.\n", "\n", "**Returns**\n", "\n", "pandas DataFrame with columns ['antecedants', 'consequents',\n", " 'support', 'lift', 'confidence'] of all rules for which\n", " metric(rule) >= min_threshold.\n", "\n", "\n" ] } ], "source": [ "with open('../../api_modules/mlxtend.frequent_patterns/association_rules.md', 'r') as f:\n", " print(f.read())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 1 }