{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fonnesbeck/Bios8366/blob/master/notebooks/Section6_7-Machine-Learning-Visualization.ipynb)\n", "\n", "# Machine Learning Visualization Tools\n", "\n", "While `scikit-learn` includes a rich selection of model diagnostic and selection tools, model evaluation is often aided by the generation of visualizations, particularly when there are a large number of features involved. This short tutorial introduces the [YellowBrick](http://www.scikit-yb.org) package, which extends the scikit-learn API with visual analysis and diagnostic tools. The Yellowbrick API also wraps matplotlib to generate figures and interactive data explorations while still allowing developers fine-grain control of figures. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models and assist in diagnosing problems throughout the machine learning workflow." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!conda install -y yellowbrick" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import pandas as pd\n", "import warnings\n", "import numpy as np\n", "\n", "warnings.simplefilter('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Anscombe's quartet illustrates why visualization is important for model evaluation!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import yellowbrick as yb\n", "import matplotlib.pyplot as plt\n", "\n", "g = yb.anscombe()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualizers\n", "\n", "Yellowbrick extends the Scikit-Learn API with a new core object: the Visualizer. Visualizers allow visual models to be fit and transformed as part of the Scikit-Learn Pipeline process, providing visual diagnostics throughout the transformation of high dimensional data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Rank1D` and `Rank2D` evaluate single features or pairs of features using a variety of metrics that score the features along the range [-1, 1] or [0, 1], thereby allowing them to be ranked. A similar concept to scatterplot matrices (SPLOM), the scores are displayed on a lower triangular heatmap so that patterns between pairs of features can be easily discerned for downstream analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the data set\n", "DATA_URL = 'https://raw.githubusercontent.com/fonnesbeck/Bios8366/master/data/'\n", "try:\n", " data = pd.read_csv('../data/credit.csv')\n", "except FileNotFoundError:\n", " data = pd.read_csv(DATA_URL + 'credit.csv')\n", "\n", "# Specify the features of interest\n", "features = [\n", " 'limit', 'sex', 'edu', 'married', 'age', 'apr_delay', 'may_delay',\n", " 'jun_delay', 'jul_delay', 'aug_delay', 'sep_delay', 'apr_bill', 'may_bill',\n", " 'jun_bill', 'jul_bill', 'aug_bill', 'sep_bill', 'apr_pay', 'may_pay', 'jun_pay',\n", " 'jul_pay', 'aug_pay', 'sep_pay',\n", " ]\n", "\n", "# Extract the numpy arrays from the data frame\n", "X = data[features]\n", "y = data.default" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from yellowbrick.utils import is_dataframe\n", "from yellowbrick.features.base import FeatureVisualizer\n", "from yellowbrick.exceptions import YellowbrickValueError\n", "from yellowbrick.style.colors import resolve_colors, get_color_cycle\n", "from yellowbrick.features.rankd import Rank1D\n", "\n", "# Instantiate the 1D visualizer with the Sharpiro ranking algorithm\n", "visualizer = Rank1D(features=features, algorithm='shapiro')\n", "\n", "visualizer.fit(X, y) # Fit the data to the visualizer\n", "visualizer.transform(X) # Transform the data\n", "visualizer.poof() # Draw/show/poof the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A two dimensional ranking of features in `Rank2D` applies a ranking algorithm that relates features pair-wise (covariance, by default)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from yellowbrick.features.rankd import Rank2D\n", "\n", "# Instantiate the visualizer with the Covariance ranking algorithm\n", "visualizer = Rank2D(features=features, algorithm='covariance')\n", "\n", "visualizer.fit(X, y) # Fit the data to the visualizer\n", "visualizer.transform(X) # Transform the data\n", "visualizer.poof() # Draw/show/poof the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Instantiate the visualizer with the Pearson ranking algorithm\n", "visualizer = Rank2D(features=features, algorithm='pearson')\n", "\n", "visualizer.fit(X, y) # Fit the data to the visualizer\n", "visualizer.transform(X) # Transform the data\n", "visualizer.poof() # Draw/show/poof the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A joint plot visualizer plots a feature against the target and shows the distribution of each via a histogram on each axis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the data\n", "try:\n", " df = pd.read_csv('../data/concrete.csv')\n", "except FileNotFoundError:\n", " df = pd.read_csv(DATA_URL + 'concrete.csv')\n", " \n", "feature = 'cement'\n", "target = 'strength'\n", "\n", "# Get the X and y data from the DataFrame\n", "X = df[feature]\n", "y = df[target]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from yellowbrick.features import JointPlotVisualizer\n", "\n", "visualizer = JointPlotVisualizer(feature=feature, target=target)\n", "\n", "visualizer.fit(X, y)\n", "visualizer.poof()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Where the density of points is large, hexbins can be used to plot the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "visualizer = JointPlotVisualizer(\n", " feature=feature, target=target, joint_plot='hex'\n", ")\n", "\n", "visualizer.fit(X, y)\n", "visualizer.poof()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regression evaluation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import Ridge\n", "from yellowbrick.regressor import PredictionError\n", "from sklearn.model_selection import train_test_split\n", "\n", "# Create training and test sets\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " df[[feature]], y, test_size=400\n", ")\n", "\n", "visualizer = PredictionError(Ridge(alpha=3.612))\n", "visualizer.fit(X_train, y_train)\n", "visualizer.score(X_test, y_test)\n", "visualizer.poof()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualization of test-set errors\n", "\n", "Using YellowBrick we can show the residuals (difference between the predicted value and the truth) both for the training set and the testing set (respectively blue and green).\n", "\n", "If the training and testing residuals had a different distribution it might indicate various problems:\n", "\n", "- small training residuals and large testing residuals might indicate over-fitting\n", "- differing distributions of values along the x axis might suggest that the training and testing values don't represent similar samples\n", "- the size of the residuals at certain points in the range might be very-wrong suggesting poor model convergence in these areas" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the data\n", "feature_names = ['cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age']\n", "target_name = 'strength'\n", "\n", "# Get the X and y data from the DataFrame\n", "X = df[feature_names]\n", "y = df[target_name]\n", "\n", "# Create the train and test data\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from yellowbrick.regressor import ResidualsPlot\n", "\n", "# Instantiate the linear model and visualizer\n", "ridge = Ridge()\n", "visualizer = ResidualsPlot(ridge)\n", "\n", "visualizer.fit(X_train, y_train) # Fit the training data to the visualizer\n", "visualizer.score(X_test, y_test) # Evaluate the model on the test data\n", "g = visualizer.poof() # Draw/show/poof the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "try:\n", " data = pd.read_csv('../data/bikeshare.csv')\n", "except FileNotFoundError:\n", " data = pd.read_csv(DATA_URL + 'bikeshare.csv')\n", " \n", "X = data[[\n", " \"season\", \"month\", \"hour\", \"holiday\", \"weekday\", \"workingday\",\n", " \"weather\", \"temp\", \"feelslike\", \"humidity\", \"windspeed\"\n", "]]\n", "y = data[\"riders\"]\n", "\n", "# Create training and test sets\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=400\n", ")\n", "\n", "visualizer = ResidualsPlot(LinearRegression())\n", "visualizer.fit(X_train, y_train)\n", "visualizer.score(X_test, y_test)\n", "visualizer.poof()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tuning regularization\n", "\n", "Regularization is designed to penalize model complexity, therefore the higher the alpha, the less complex the model, decreasing the error due to variance (overfitting). Alphas that are too high on the other hand increase the error due to bias (underfitting). It is important, therefore to choose an optimal alpha such that the error is minimized in both directions.\n", "\n", "The `AlphaSelection` Visualizer demonstrates how different values of alpha influence model selection during the regularization of linear models." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import RidgeCV\n", "from yellowbrick.regressor import AlphaSelection\n", "\n", "alphas = np.logspace(-10, 1, 200)\n", "visualizer = AlphaSelection(RidgeCV(alphas=alphas))\n", "visualizer.fit(X, y)\n", "visualizer.poof()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LassoCV\n", "\n", "feature_names = ['cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age']\n", "target_name = 'strength'\n", "\n", "# Get the X and y data from the DataFrame\n", "X = df[feature_names]\n", "y = df[target_name]\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", "\n", "# Create a list of alphas to cross-validate against\n", "alphas = np.logspace(-12, -0.5, 400)\n", "\n", "# Instantiate the linear model and visualizer\n", "model = LassoCV(alphas=alphas)\n", "visualizer = AlphaSelection(model)\n", "\n", "visualizer.fit(X_train, y_train) # Fit the training data to the visualizer\n", "g = visualizer.poof() # Draw/show/poof the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: Classifying poisonous mushrooms\n", "\n", "We will use a new sample dataset to demonstrate a viable model development pipeline. The data include descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species was identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended (this latter class was combined with the poisonous one). The objective is to build a model for classifying the mushrooms based on their physical characteristics." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "\n", "names = [\n", " 'class',\n", " 'cap-shape',\n", " 'cap-surface',\n", " 'cap-color'\n", "]\n", "\n", "dataset = pd.read_csv('../data/mushroom.csv')\n", "dataset.columns = names\n", "dataset.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features = ['cap-shape', 'cap-surface', 'cap-color']\n", "target = ['class']\n", "\n", "X = dataset[features]\n", "y = dataset[target]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To facilitate the development of custom machine learning classes, scikit-learn provides templates in the form of `Mixin` classes that allows for easy subclassing of the major classes, so that new classes conform to the scikit-learn API. For example, we can create a custom encoder by inheriting from `BaseEstimator` (the parent class of all scikit-learn estimators) and the `TransformerMixin`, which provides hooks into the `fit` and `transform` methods. All that is required from the user is the specification of these methods to implement the custom transformation of interest." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.base import BaseEstimator, TransformerMixin\n", "from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n", "\n", "\n", "class EncodeCategorical(BaseEstimator, TransformerMixin):\n", " \"\"\"\n", " Encodes a specified list of columns or all columns if None.\n", " \"\"\"\n", "\n", " def __init__(self, columns=None):\n", " self.columns = [col for col in columns]\n", " self.encoders = None\n", "\n", " def fit(self, data, target=None):\n", " \"\"\"\n", " Expects a data frame with named columns to encode.\n", " \"\"\"\n", " # Encode all columns if columns is None\n", " if self.columns is None:\n", " self.columns = data.columns\n", "\n", " # Fit a label encoder for each column in the data frame\n", " self.encoders = {\n", " column: LabelEncoder().fit(data[column])\n", " for column in self.columns\n", " }\n", " return self\n", "\n", " def transform(self, data):\n", " \"\"\"\n", " Uses the encoders to transform a data frame.\n", " \"\"\"\n", " output = data.copy()\n", " for column, encoder in self.encoders.items():\n", " output[column] = encoder.transform(data[column])\n", "\n", " return output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our data is categorical, so we will need to encode these variables as numeric values for machine learning. As we know, sckit-learn provides a `LabelEncoder` transformer for converting categorical labels into numeric integers, but it can only transform a single vector at a time, so we’ll have to adapt it in order to apply it to multiple columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By inheriting from the scikit-learn classes, our new encoder class can be used in the creation of a model selection pipeline. This process will include:\n", "\n", "- encoding categorical variables in the dataset\n", "- performing one-hot encoding on the categorical variables\n", "- applying the resulting encoded dataset to a particular estimator\n", "- returning a relevant metric for performing model selection\n", "\n", "Since we are doing binary classification, we will use the Brier score for model evaluation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import brier_score_loss\n", "from sklearn.pipeline import Pipeline\n", "\n", "\n", "def model_selection(X, y, estimator, metric=brier_score_loss):\n", " \"\"\"\n", " Test various estimators.\n", " \"\"\"\n", " y = LabelEncoder().fit_transform(y.values.ravel())\n", " model = Pipeline([\n", " ('label_encoding', EncodeCategorical(X.keys())),\n", " ('one_hot_encoder', OneHotEncoder()),\n", " ('estimator', estimator)\n", " ])\n", "\n", " # Instantiate the classification model and visualizer\n", " model.fit(X, y)\n", "\n", " expected = y\n", " predicted = model.predict(X)\n", "\n", " # Compute and return the F1 score (the harmonic mean of precision and recall)\n", " return (metric(expected, predicted))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.svm import LinearSVC, NuSVC, SVC\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.linear_model import LogisticRegressionCV, LogisticRegression, SGDClassifier\n", "from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_selection(X, y, LinearSVC())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_selection(X, y, SVC())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_selection(X, y, NuSVC())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_selection(X, y, SGDClassifier())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_selection(X, y, KNeighborsClassifier())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_selection(X, y, LogisticRegressionCV())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_selection(X, y, LogisticRegression())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_selection(X, y, BaggingClassifier())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_selection(X, y, ExtraTreesClassifier())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_selection(X, y, RandomForestClassifier())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let’s refactor our model evaluation function to use Yellowbrick’s `ClassificationReport` class, a visualizer that displays the precision, recall, and F1 scores. \n", "\n", "> The F1 score is a measure of a test’s accuracy. It considers both the precision and the recall of the test to compute the score. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0.\n", "\n", "This visual model analysis tool integrates numerical scores as well as color-coded heatmaps in order to support easy interpretation and detection, particularly the nuances of Type I and Type II error." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "from yellowbrick.classifier import ClassificationReport\n", "\n", "\n", "def visual_model_selection(X, y, estimator):\n", " \"\"\"\n", " Test various estimators.\n", " \"\"\"\n", " y = LabelEncoder().fit_transform(y.values.ravel())\n", " model = Pipeline([\n", " ('label_encoding', EncodeCategorical(X.keys())),\n", " ('one_hot_encoder', OneHotEncoder()),\n", " ('estimator', estimator)\n", " ])\n", "\n", " # Instantiate the classification model and visualizer\n", " visualizer = ClassificationReport(model, classes=['edible', 'poisonous'])\n", " visualizer.fit(X, y)\n", " visualizer.score(X, y)\n", " visualizer.poof()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "visual_model_selection(X, y, LinearSVC())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "visual_model_selection(X, y, NuSVC())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "visual_model_selection(X, y, SVC())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "visual_model_selection(X, y, SGDClassifier())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "visual_model_selection(X, y, KNeighborsClassifier())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "visual_model_selection(X, y, LogisticRegressionCV())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "visual_model_selection(X, y, LogisticRegression())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "visual_model_selection(X, y, BaggingClassifier())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "visual_model_selection(X, y, ExtraTreesClassifier())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "visual_model_selection(X, y, RandomForestClassifier())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Confusion Matrices\n", "\n", "The `ConfusionMatrix` visualizer is a `ScoreVisualizer` that takes a fitted scikit-learn classifier and a set of test X and y values and returns a report showing how each of the test values predicted classes compare to their actual classes. Data scientists use confusion matrices to understand which classes are most easily confused. These provide similar information as what is available in a `ClassificationReport`, but rather than top-level scores, they provide deeper insight into the classification of individual data points." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from yellowbrick.classifier import ConfusionMatrix\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.datasets import load_digits\n", "\n", "# We'll use the handwritten digits data set from scikit-learn.\n", "# Each feature of this dataset is an 8x8 pixel image of a handwritten number.\n", "# Digits.data converts these 64 pixels into a single array of features\n", "digits = load_digits()\n", "X = digits.data\n", "y = digits.target\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X,y, test_size =0.2, random_state=11)\n", "\n", "model = LogisticRegression()\n", "\n", "#The ConfusionMatrix visualizer taxes a model\n", "cm = ConfusionMatrix(model, classes=[0,1,2,3,4,5,6,7,8,9])\n", "\n", "#Fit fits the passed model. This is unnecessary if you pass the visualizer a pre-fitted model\n", "cm.fit(X_train, y_train)\n", "\n", "#To create the ConfusionMatrix, we need some test data. Score runs predict() on the data\n", "#and then creates the confusion_matrix from scikit learn.\n", "cm.score(X_test, y_test)\n", "\n", "#How did we do?\n", "cm.poof()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from yellowbrick.features.radviz import RadViz\n", "from yellowbrick.features.pcoords import ParallelCoordinates\n", "from yellowbrick.features.pca import PCADecomposition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**RadViz** is a multivariate data visualization algorithm that plots each feature dimension uniformly around the circumference of a circle then plots points on the interior of the circle such that the point normalizes its values on the axes from the center to each arc. This mechanism allows as many dimensions as will easily fit on a circle, greatly expanding the dimensionality of the visualization.\n", "\n", "Data scientists use this method to detect separability between classes. E.g. is there an opportunity to learn from the feature set or is there just too much noise?\n", "\n", "This example uses experimental data for binary classification (room occupancy) from Temperature, Humidity, Light and CO2. Ground-truth occupancy was obtained from time stamped pictures that were taken every minute." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the classification data set\n", "try:\n", " data = pd.read_csv('../data/occupancy.csv')\n", "except FileNotFoundError:\n", " data = pd.read_csv(DATA_URL + 'occupancy.csv')\n", "\n", "# Specify the features of interest and the classes of the target\n", "features = [\"temperature\", \"relative humidity\", \"light\", \"C02\", \"humidity\"]\n", "classes = ['unoccupied', 'occupied']\n", "\n", "# Extract the numpy arrays from the data frame\n", "X = data[features]\n", "y = data.occupancy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import the visualizer\n", "from yellowbrick.features import RadViz\n", "\n", "# Instantiate the visualizer\n", "visualizer = RadViz(classes=classes, features=features)\n", "\n", "visualizer.fit(X, y) # Fit the data to the visualizer\n", "visualizer.transform(X) # Transform the data\n", "visualizer.poof() # Draw/show/poof the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Parallel coordinates** is multi-dimensional feature visualization technique where the vertical axis is duplicated horizontally for each feature. Instances are displayed as a single line segment drawn from each vertical axes to the location representing their value for that feature. This allows many dimensions to be visualized at once.\n", "\n", "This technique can be used to detect clusters of instances that have similar classes, and to note features that have high variance or different distributions. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Instantiate the visualizer\n", "visualizer = ParallelCoordinates(classes=classes, features=features, normalize='standard', sample = 0.1)\n", "\n", "visualizer.fit(X, y) # Fit the data to the visualizer\n", "visualizer.transform(X) # Transform the data\n", "visualizer.poof() # Draw/show/poof the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By inspecting the visualization closely, we can see that the combination of transparency and overlap gives us the sense of groups of similar instances, sometimes referred to as *braids*. If there are distinct braids of different classes, it suggests that there is enough separability that a classification algorithm might be able to discern between each class." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The PCA Decomposition visualizer utilizes principal component analysis to decompose high dimensional data into two or three dimensions so that each instance can be plotted in a scatter plot. The use of PCA means that the projected dataset can be analyzed along axes of principal variation and can be interpreted to determine if spherical distance metrics can be utilized." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the classification data set\n", "try:\n", " data = pd.read_csv('../data/credit.csv')\n", "except FileNotFoundError:\n", " data = pd.read_csv(DATA_URL + 'credit.csv')\n", "\n", "# Specify the features of interest\n", "features = [\n", " 'limit', 'sex', 'edu', 'married', 'age', 'apr_delay', 'may_delay',\n", " 'jun_delay', 'jul_delay', 'aug_delay', 'sep_delay', 'apr_bill', 'may_bill',\n", " 'jun_bill', 'jul_bill', 'aug_bill', 'sep_bill', 'apr_pay', 'may_pay', 'jun_pay',\n", " 'jul_pay', 'aug_pay', 'sep_pay',\n", "]\n", "\n", "# Extract the numpy arrays from the data frame\n", "X = data[features]\n", "y = data.default" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a list of colors to assign to points in the plot\n", "colors = np.array(['r' if yi else 'b' for yi in y])\n", "\n", "visualizer = PCADecomposition(scale=True, center=False, color=colors)\n", "visualizer.fit_transform(X,y)\n", "visualizer.poof()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "visualizer = PCADecomposition(scale=True, center=False, color=colors, proj_dim=3)\n", "visualizer.fit_transform(X,y)\n", "visualizer.poof()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The PCA projection can be enhanced to a biplot whose points are the projected instances and whose vectors represent the structure of the data in high dimensional space. By using the `proj_features=True` flag, vectors for each feature in the dataset are drawn on the scatter plot in the direction of the maximum variance for that feature. These structures can be used to analyze the importance of a feature to the decomposition or to find features of related variance for further analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('../data/concrete.csv')\n", "target = \"strength\"\n", "features = [\n", " 'cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age'\n", "]\n", "\n", "# Extract the instance data and the target\n", "X = df[features]\n", "y = df[target]\n", "\n", "visualizer = PCADecomposition(scale=True, proj_features=True)\n", "visualizer.fit_transform(X, y)\n", "visualizer.poof()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ROCAUC\n", "\n", "A ROCAUC (Receiver Operating Characteristic/Area Under the Curve) plot allows the user to visualize the tradeoff between the classifier’s sensitivity and specificity.\n", "\n", "The Receiver Operating Characteristic (ROC) is a measure of a classifier’s predictive quality that compares and visualizes the tradeoff between the model’s sensitivity and specificity. When plotted, a ROC curve displays the true positive rate on the Y axis and the false positive rate on the X axis on both a global average and per-class basis. The ideal point is therefore the top-left corner of the plot: false positives are zero and true positives are one.\n", "\n", "This leads to another metric, area under the curve (AUC), which is a computation of the relationship between false positives and true positives. The higher the AUC, the better the model generally is. However, it is also important to inspect the “steepness” of the curve, as this describes the maximization of the true positive rate while minimizing the false positive rate." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the classification data set\n", "data = pd.read_csv('../data/occupancy.csv')\n", "\n", "# Specify the features of interest and the classes of the target\n", "features = [\"temperature\", \"relative humidity\", \"light\", \"C02\", \"humidity\"]\n", "classes = ['unoccupied', 'occupied']\n", "\n", "X = data[features]\n", "y = data.occupancy\n", "\n", "# Create the train and test data\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from yellowbrick.classifier import ROCAUC\n", "\n", "# Instantiate the classification model and visualizer\n", "logistic = LogisticRegression()\n", "visualizer = ROCAUC(logistic)\n", "\n", "visualizer.fit(X_train, y_train) # Fit the training data to the visualizer\n", "visualizer.score(X_test, y_test) # Evaluate the model on the test data\n", "g = visualizer.poof() # Draw/show/poof the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluating Class Balance \n", "\n", "One of the biggest challenges for classification models is an imbalance of classes in the training data. Severe class imbalances may be masked by relatively good F1 and accuracy scores – the classifier is simply guessing the majority class and not making any evaluation on the underrepresented class.\n", "\n", "There are several techniques for dealing with class imbalance such as stratified sampling, down sampling the majority class, weighting, etc. But before these actions can be taken, it is important to understand what the class balance is in the training data. The `ClassBalance` visualizer supports this by creating a bar chart of the support for each class, that is the frequency of the classes’ representation in the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the classification data set\n", "data = pd.read_csv('../data/occupancy.csv')\n", "\n", "# Specify the features of interest and the classes of the target\n", "features = [\"temperature\", \"relative humidity\", \"light\", \"C02\", \"humidity\"]\n", "classes = ['unoccupied', 'occupied']\n", "\n", "X = data[features]\n", "y = data.occupancy\n", "\n", "# Create the train and test data\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from yellowbrick.classifier import ClassBalance\n", "\n", "# Instantiate the classification model and visualizer\n", "visualizer = ClassBalance(labels=classes)\n", "\n", "visualizer.fit(y_train, y_test)\n", "visualizer.poof()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cluster Selection\n", "\n", "### Elbow method\n", "\n", "The `KElbowVisualizer` implements the “elbow” method to help select the optimal number of clusters by fitting the model with a range of values for K. If the line chart resembles an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point.\n", "\n", "To demonstrate, in the following example the `KElbowVisualizer` fits the `KMeans` model for a range of K values from 2 to 10 on a sample two-dimensional dataset with 8 random clusters of points. When the model is fit with 8 clusters, we can see an “elbow” in the graph, which in this case we know to be the optimal number." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_blobs\n", "\n", "# Make 8 blobs dataset\n", "X, y = make_blobs(centers=8)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from yellowbrick.cluster import KElbowVisualizer\n", "from sklearn.cluster import KMeans\n", "from yellowbrick.cluster import SilhouetteVisualizer\n", "\n", "model = KElbowVisualizer(KMeans(), k=10)\n", "model.fit(X)\n", "model.poof()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, the scoring parameter metric is set to `distortion`, which computes the sum of squared distances from each point to its assigned center. However, two other metrics can also be used with the`KElbowVisualizer` -- `silhouette` and `calinski_harabaz`. The `silhouette` score calculates the mean silhouette coefficient of all samples, while the `calinski_harabaz` score computes the ratio of dispersion between and within clusters.\n", "\n", "The `KElbowVisualizer` also displays the amount of time to train the clustering model per K as a dashed green line, but is can be hidden by setting `timings=False`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Silhouette analysis\n", "\n", "Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a range of [-1, 1].\n", "\n", "- values near +1 indicate that the sample is far away from the neighboring clusters.\n", "- value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters\n", "- values near -1 indicate that those samples might have been assigned to the wrong cluster\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `SilhouetteVisualizer` displays the silhouette coefficient for each sample on a per-cluster basis, visualizing which clusters are dense and which are not. This is particularly useful for determining cluster imbalance, or for selecting a value for K by comparing multiple visualizers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import MiniBatchKMeans\n", "\n", "# Instantiate the clustering model and visualizer\n", "model = MiniBatchKMeans(6)\n", "visualizer = SilhouetteVisualizer(model)\n", "\n", "visualizer.fit(X) # Fit the training data to the visualizer\n", "visualizer.poof() # Draw/show/poof the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Intercluster Distance Maps\n", "\n", "Intercluster distance maps display an embedding of the cluster centers in 2 dimensions with the distance to other centers preserved. Thus, the closer to centers are in the visualization, the closer they are in the original feature space. \n", "\n", "The clusters are sized according to a scoring metric, which by default is membership, i.e. the number of instances that belong to each center. This gives a sense of the relative importance of clusters. Note however, that because two clusters overlap in the 2D space, it does not imply that they overlap in the original feature space." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from yellowbrick.cluster import InterclusterDistance\n", "\n", "# Instantiate the clustering model and visualizer\n", "visualizer = InterclusterDistance(KMeans(9))\n", "\n", "visualizer.fit(X)\n", "visualizer.poof()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## References\n", "\n", "- [YellowBrick Model Selection Tutorial](http://www.scikit-yb.org/en/latest/tutorial.html)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.9" } }, "nbformat": 4, "nbformat_minor": 2 }