{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Correlation Plot\n", "\n", "The `corr_plot` builder takes a dataframe (can be Pandas `Dataframe` or just Python `dict`) as the input and \n", "builds a correlation plot.\n", "\n", "It allows to combine 'tile', 'point' or 'label' layers in a matrix of 'full', 'lower' or 'upper' type.\n", "\n", "A call to the terminal `build()` method will create a resulting 'plot' object. \n", "This 'plot' object can be further refined using regular Lets-Plot (ggplot) API, like `+ ggtitle()`, `+ ggsize()` and so on.\n", "\n", "\n", "The Ames Housing dataset for this demo was downloaded from [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv) (train.csv), (c) Kaggle." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:33:15.544612Z", "iopub.status.busy": "2024-04-17T07:33:15.544467Z", "iopub.status.idle": "2024-04-17T07:33:15.874773Z", "shell.execute_reply": "2024-04-17T07:33:15.874452Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "from lets_plot import *\n", "from lets_plot.bistro.corr import *\n", "\n", "LetsPlot.setup_html()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:33:15.887781Z", "iopub.status.busy": "2024-04-17T07:33:15.887642Z", "iopub.status.idle": "2024-04-17T07:33:16.024441Z", "shell.execute_reply": "2024-04-17T07:33:16.024259Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(234, 5)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
displyearcylctyhwy
01.8199941829
11.8199942129
22.0200842031
32.0200842130
42.8199961626
\n", "
" ], "text/plain": [ " displ year cyl cty hwy\n", "0 1.8 1999 4 18 29\n", "1 1.8 1999 4 21 29\n", "2 2.0 2008 4 20 31\n", "3 2.0 2008 4 21 30\n", "4 2.8 1999 6 16 26" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mpg_df = pd.read_csv('https://raw.githubusercontent.com/JetBrains/lets-plot-docs/master/data/mpg.csv')\\\n", " .drop(columns=['Unnamed: 0']).select_dtypes(include=np.number)\n", "print(mpg_df.shape)\n", "mpg_df.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:33:16.025543Z", "iopub.status.busy": "2024-04-17T07:33:16.025464Z", "iopub.status.idle": "2024-04-17T07:33:16.027531Z", "shell.execute_reply": "2024-04-17T07:33:16.027339Z" } }, "outputs": [], "source": [ "def group(plots, width=400, height=300):\n", " \"\"\"\n", " Useful for this demo.\n", " \"\"\"\n", " bunch = GGBunch()\n", " for idx, p in enumerate(plots):\n", " x = (idx % 2) * width\n", " y = int(idx / 2) * height\n", " bunch.add_plot(p, x, y, width, height)\n", " \n", " return bunch " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining 'tile', 'point' and 'label' layers.\n", "\n", "When combining layers, `corr_plot` chooses an acceptable plot configuration by default." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:33:16.028642Z", "iopub.status.busy": "2024-04-17T07:33:16.028511Z", "iopub.status.idle": "2024-04-17T07:33:16.036643Z", "shell.execute_reply": "2024-04-17T07:33:16.036424Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group([\n", " corr_plot(mpg_df).tiles().build() + ggtitle(\"Tiles\"),\n", " corr_plot(mpg_df).points().build() + ggtitle(\"Points\"), \n", " corr_plot(mpg_df).tiles().labels().build() + ggtitle(\"Tiles and labels\"),\n", " corr_plot(mpg_df).points().labels().tiles().build() + ggtitle(\"Tiles, points and labels\")\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The default plot configuration adapts to the changing options - compare 'Tiles and labels' plot above and below.\n", "\n", "You can also override the default plot configuration using the parameter 'type' - compare 'Tiles, points and labels' plot above and below." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:33:16.037769Z", "iopub.status.busy": "2024-04-17T07:33:16.037616Z", "iopub.status.idle": "2024-04-17T07:33:16.041641Z", "shell.execute_reply": "2024-04-17T07:33:16.041469Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group([\n", " corr_plot(mpg_df).tiles().labels(color=\"white\").build() + ggtitle(\"Tiles and labels\"),\n", " (corr_plot(mpg_df)\n", " .tiles(type=\"upper\")\n", " .points(type=\"lower\")\n", " .labels(type=\"full\").build() + ggtitle(\"Tiles, points and labels\"))\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Customizing colors.\n", "\n", "Instead of the default blue-grey-red gradient you can define your own lower-middle-upper colors, or \n", "choose one of the available 'Brewer' diverging palettes.\n", "\n", "Let's create a gradient resembling one of Seaborn gradients." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:33:16.042722Z", "iopub.status.busy": "2024-04-17T07:33:16.042608Z", "iopub.status.idle": "2024-04-17T07:33:16.044413Z", "shell.execute_reply": "2024-04-17T07:33:16.044233Z" } }, "outputs": [], "source": [ "bld = corr_plot(mpg_df).points().labels().tiles()\n", "\n", "# Configure gradient resembling one of Seaborn gradients.\n", "gradient = (bld\n", " .palette_gradient(low='#417555', mid='#EDEDED', high='#963CA7')\n", " .build()) + ggtitle(\"Custom gradient\")\n", "\n", "# Configure Brewer 'BrBG' palette.\n", "brewer = (bld\n", " .palette_BrBG()\n", " .build()) + ggtitle(\"Brewer\")\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:33:16.045378Z", "iopub.status.busy": "2024-04-17T07:33:16.045259Z", "iopub.status.idle": "2024-04-17T07:33:16.048931Z", "shell.execute_reply": "2024-04-17T07:33:16.048755Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group([\n", " gradient,\n", " brewer\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \n", "### Correlation plot with large number of variables in dataset.\n", "\n", "The [Kaggle House Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv) dataset contains 81 variables." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:33:16.049906Z", "iopub.status.busy": "2024-04-17T07:33:16.049795Z", "iopub.status.idle": "2024-04-17T07:33:16.526316Z", "shell.execute_reply": "2024-04-17T07:33:16.526128Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1460, 38)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdMSSubClassLotFrontageLotAreaOverallQualOverallCondYearBuiltYearRemodAddMasVnrAreaBsmtFinSF1...WoodDeckSFOpenPorchSFEnclosedPorch3SsnPorchScreenPorchPoolAreaMiscValMoSoldYrSoldSalePrice
016065.084507520032003196.0706...0610000022008208500
122080.0960068197619760.0978...29800000052007181500
236068.0112507520012002162.0486...0420000092008223500
347060.0955075191519700.0216...035272000022006140000
456084.0142608520002000350.0655...1928400000122008250000
\n", "

5 rows × 38 columns

\n", "
" ], "text/plain": [ " Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt \\\n", "0 1 60 65.0 8450 7 5 2003 \n", "1 2 20 80.0 9600 6 8 1976 \n", "2 3 60 68.0 11250 7 5 2001 \n", "3 4 70 60.0 9550 7 5 1915 \n", "4 5 60 84.0 14260 8 5 2000 \n", "\n", " YearRemodAdd MasVnrArea BsmtFinSF1 ... WoodDeckSF OpenPorchSF \\\n", "0 2003 196.0 706 ... 0 61 \n", "1 1976 0.0 978 ... 298 0 \n", "2 2002 162.0 486 ... 0 42 \n", "3 1970 0.0 216 ... 0 35 \n", "4 2000 350.0 655 ... 192 84 \n", "\n", " EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold \\\n", "0 0 0 0 0 0 2 2008 \n", "1 0 0 0 0 0 5 2007 \n", "2 0 0 0 0 0 9 2008 \n", "3 272 0 0 0 0 2 2006 \n", "4 0 0 0 0 0 12 2008 \n", "\n", " SalePrice \n", "0 208500 \n", "1 181500 \n", "2 223500 \n", "3 140000 \n", "4 250000 \n", "\n", "[5 rows x 38 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "housing_df = pd.read_csv(\"https://raw.githubusercontent.com/JetBrains/lets-plot-docs/master/data/Ames_house_prices_train.csv\")\\\n", " .select_dtypes(include=np.number)\n", "print(housing_df.shape)\n", "housing_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Correlation plot that shows all the correlations in this dataset is too large and barely useful. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:33:16.527393Z", "iopub.status.busy": "2024-04-17T07:33:16.527316Z", "iopub.status.idle": "2024-04-17T07:33:16.536625Z", "shell.execute_reply": "2024-04-17T07:33:16.536442Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corr_plot(housing_df).tiles(type='lower').palette_BrBG().build()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### The 'threshold' parameter.\n", "\n", "The 'threshold' parameter let us specify a level of significance, below which variables are not shown." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:33:16.537733Z", "iopub.status.busy": "2024-04-17T07:33:16.537571Z", "iopub.status.idle": "2024-04-17T07:33:16.545005Z", "shell.execute_reply": "2024-04-17T07:33:16.544826Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(corr_plot(housing_df, threshold=.5).tiles(diag=False).palette_BrBG().build() \n", " + ggtitle(\"Threshold: 0.5\")\n", " + ggsize(550, 400))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Let's further increase our threshold in order to see only highly correlated variables.\n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:33:16.546146Z", "iopub.status.busy": "2024-04-17T07:33:16.545978Z", "iopub.status.idle": "2024-04-17T07:33:16.552398Z", "shell.execute_reply": "2024-04-17T07:33:16.552222Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(corr_plot(housing_df, threshold=.8)\n", " .tiles(diag=False)\n", " .palette_BrBG().build() \n", " + ggtitle(\"Threshold: 0.8\")\n", " + ggsize(550, 400))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 4 }