{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import rulematrix\n", "from rulematrix.surrogate import rule_surrogate\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.preprocessing import OneHotEncoder\n", "from sklearn.pipeline import Pipeline\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.datasets import load_breast_cancer, load_iris" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Dataset\n", "\n", "First, we load a dataset.\n", "To make use of the visualization, it's better to provide feature names and target names.\n", "\n", "We partition the dataset into training and test set." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Load dataset\n", "# dataset = load_iris()\n", "dataset = load_breast_cancer()\n", "\n", "# Feature Information\n", "is_continuous = dataset.get('is_continuous', None)\n", "is_categorical = dataset.get('is_categorical', None)\n", "is_integer = dataset.get('is_integer', None)\n", "feature_names = dataset.get('feature_names', None)\n", "target_names = dataset.get('target_names', None)\n", "\n", "# Split dataset into train and test\n", "train_x, test_x, train_y, test_y = \\\n", " train_test_split(dataset['data'], dataset['target'], test_size=0.25, random_state=42)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training a Neural Net" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training score: 0.9061032863849765\n", "Test score: 0.9230769230769231\n" ] } ], "source": [ "def train_nn(neurons=(20,), **kwargs):\n", " is_categorical = dataset.get('is_categorical', None)\n", " model = MLPClassifier(hidden_layer_sizes=neurons, **kwargs)\n", " if is_categorical is not None:\n", " model = Pipeline([\n", " ('one_hot', OneHotEncoder(categorical_features=is_categorical)),\n", " ('mlp', model)\n", " ])\n", " model.fit(train_x, train_y)\n", " train_score = model.score(train_x, train_y)\n", " test_score = model.score(test_x, test_y)\n", " print('Training score:', train_score)\n", " print('Test score:', test_score)\n", " return model\n", "\n", "\n", "nn = train_nn((20, 20, 20), random_state=43)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train Rule Surrogate\n", "\n", "Next we train the surrogate rulelist of the neural net, using default parameters, and render the RuleMatrix visualization." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training fidelity: 0.8779342723004695\n", "Test fidelity: 0.8951048951048951\n", "The rule list contains 10 of rules:\n", "\n", " IF (worst area in (-inf, 91.3)) THEN prob: [0.9947, 0.0053]\n", "\n", "ELSE IF (area error in (74.37, inf)) THEN prob: [0.9973, 0.0027]\n", "\n", "ELSE IF (mean perimeter in (-inf, 52.41)) THEN prob: [0.6957, 0.3043]\n", "\n", "ELSE IF (worst area in (1134.9, inf)) THEN prob: [0.9862, 0.0138]\n", "\n", "ELSE IF (area error in (48.8, 74.37)) AND (mean compactness in (0.14295, inf)) THEN prob: [0.8868, 0.1132]\n", "\n", "ELSE IF (mean perimeter in (108.31, 120.6)) THEN prob: [0.2381, 0.7619]\n", "\n", "ELSE IF (worst area in (734.0, 958.2)) AND (mean area in (611.8, 853.1)) THEN prob: [0.5364, 0.4636]\n", "\n", "ELSE IF (worst concavity in (0.3725, inf)) AND (mean area in (63.9, 294.4)) THEN prob: [0.7600, 0.2400]\n", "\n", "ELSE IF (worst area in (126.5, 734.0)) THEN prob: [0.3571, 0.6429]\n", "\n", "ELSE DEFAULT prob: [0.9333, 0.0667]\n", "\n" ] } ], "source": [ "def train_surrogate(model, sampling_rate=2.0, **kwargs):\n", " surrogate = rule_surrogate(model.predict, train_x, sampling_rate=sampling_rate,\n", " is_continuous=is_continuous,\n", " is_categorical=is_categorical,\n", " is_integer=is_integer,\n", " rlargs={'feature_names': feature_names, 'verbose': 2},\n", " **kwargs)\n", "\n", " train_fidelity = surrogate.score(train_x)\n", " test_fidelity = surrogate.score(test_x)\n", " print('Training fidelity:', train_fidelity)\n", " print('Test fidelity:', test_fidelity)\n", " return surrogate\n", "\n", "surrogate = train_surrogate(nn, 4, seed=44)\n", "rl = surrogate.student\n", "print(rl)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Render RuleMatrix\n", "\n", "Now let's render the RuleMatrix visualization.\n", "\n", "Here is some instructions for how to read the RuleMatrix.\n", "\n", "1. In the middle is a matrix of rules. Each row represents a rule, and each column represents a feature. \n", " For example, rule 2 is IF area error > 74.4 THEN Prob(malignant) = 1.0. \n", " \n", "2. The shadowed part in the cell also indicates the value range of the feature used in rule. \n", " The histogram of light blue in the cell shows the distribution of the feature.\n", " You can hover to read the text of the rule. \n", " You can also click on a cell to expand it to look at detail feature distribution\n", "\n", "3. In the left of the matrix is the data flow, showing how all the data is captured by each of the rules. The width of the flow indicates the number of data captured/uncaptured by each rule.\n", " The color of the flow indicates different labels. For example, there are about 1:2 malignant:benign data in the breast_cancer dataset. \n", " \n", "4. In the right of the matrix shows detail information about each rule. Fidelity means how accurate a rule is in representing/approximating the original model (on the data captured by this rule). The eveidence (or support in the itemset mining term) shows the number of data with different labels. For example, the data captured by rule 9 is mostly benign. The stripped part encodes the part of data wrongly classified by the model as a certain label represent by the color." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n", "" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rulematrix.render(train_x, train_y, surrogate, \n", " feature_names=feature_names, target_names=target_names, \n", " is_categorical=is_categorical)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }