{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Correlation Plot\n", "\n", "The `CorrPlot` builder takes a dataframe (Kotlin `Map<*, *>`) as the input and builds a correlation plot.\n", "\n", "If the input has NxN shape and contains only numbers in range [0..1], then it is plotted as is. Otherwise `CorrPlot` will compute correlation coefficients using the Pearson's method. \n", "\n", "`CorrPlot` allows to combine 'tile', 'point' or 'label' layers in a matrix of \"full\", \"lower\" or \"upper\" type.\n", "\n", "A call to the terminal `build()` method will create a resulting 'plot' object. \n", "This 'plot' object can be further refined using regular Lets-Plot (ggplot) API, like `+ ggsize()` and so on.\n", "\n", "\n", "The Ames Housing dataset for this demo was downloaded from [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv) (train.csv), (c) Kaggle." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Lets-Plot Kotlin API v.4.1.1. Frontend: Notebook with dynamically loaded JS. Lets-Plot JS v.2.5.1." ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%useLatestDescriptors\n", "%use lets-plot\n", "\n", "LetsPlot.getInfo() // This prevents Krangl from loading an obsolete version of Lets-Plot classes." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%use krangl" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "| manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
| 2 | audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
| 3 | audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
Shape: 3 x 12. \n", "
" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "// Cars MPG dataset\n", "var mpg_df = DataFrame.readCSV(\"https://raw.githubusercontent.com/JetBrains/lets-plot-kotlin/master/docs/examples/data/mpg.csv\")\n", "mpg_df.head(3)\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "| manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
|---|---|---|---|---|---|---|---|---|---|---|
| audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
| audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
| audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
Shape: 3 x 11. \n", "
" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mpg_df = mpg_df.remove(\"\")\n", "mpg_df.head(3)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "val mpg_dat = mpg_df.toMap()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining 'tile', 'point' and 'label' layers.\n", "\n", "When combining layers, `CorrPlot` chooses an acceptable plot configuration by default." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gggrid(\n", " listOf(\n", " CorrPlot(mpg_dat, \"Tiles\").tiles().build(),\n", " CorrPlot(mpg_dat, \"Points\").points().build(), \n", " CorrPlot(mpg_dat, \"Tiles and labels\").tiles().labels().build(),\n", " CorrPlot(mpg_dat, \"Tiles, points and labels\").points().labels().tiles().build()\n", " ), 2, 400, 320)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The default plot configuration adapts to the changing options - compare \"Tiles and labels\" plot above and below.\n", "\n", "You can also override the default plot configuration using the parameter `type` - compare \"Tiles, points and labels\" plot above and below." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gggrid(\n", " listOf(\n", " CorrPlot(mpg_dat, \"Tiles and labels\").tiles().labels(color=\"white\").build(),\n", " CorrPlot(mpg_dat, \"Tiles, points and labels\")\n", " .tiles(type=\"upper\")\n", " .points(type=\"lower\")\n", " .labels(type=\"full\").build()\n", " ), 2, 400, 320)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Customizing colors.\n", "\n", "Instead of the default blue-grey-red gradient you can define your own lower-middle-upper colors, or \n", "choose one of the available 'Brewer' diverging palettes.\n", "\n", "Let's create a gradient resembling one of Seaborn gradients." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val corrPlot = CorrPlot(mpg_dat).points().labels().tiles()\n", "\n", "// Configure gradient resembling one of Seaborn gradients.\n", "val withGradientColors = (corrPlot\n", " .paletteGradient(low=\"#417555\", mid=\"#EDEDED\", high=\"#963CA7\")\n", " .build()) + ggtitle(\"Custom gradient\")\n", "\n", "// Configure Brewer 'BrBG' palette.\n", "val withBrewerColors = (corrPlot\n", " .paletteSpectral()\n", " .build()) + ggtitle(\"Brewer 'Spectral'\")\n", "\n", "// Show both plots\n", "gggrid(listOf(withGradientColors, withBrewerColors), 2, 400, 320)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Correlation plot with large number of variables in dataset.\n", "\n", "The [Kaggle House Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv) dataset contains 81 variables." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 60 | RL | 65 | 8450 | Pave | null | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | null | Attchd | 2003 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | null | null | null | 0 | 2 | 2008 | WD | Normal | 208500 |
| 2 | 20 | RL | 80 | 9600 | Pave | null | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | null | null | null | 0 | 5 | 2007 | WD | Normal | 181500 |
| 3 | 60 | RL | 68 | 11250 | Pave | null | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | null | null | null | 0 | 9 | 2008 | WD | Normal | 223500 |
Shape: 3 x 81. \n", "
" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val housing_df = DataFrame.readCSV(\"../data/Ames_house_prices_train.csv\")\n", "housing_df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Correlation plot that shows all the correlations in this dataset is too large and barely useful. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "CorrPlot(housing_df.toMap())\n", " .tiles(type=\"lower\")\n", " .paletteBrBG()\n", " .build()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### The `threshold` parameter.\n", "\n", "The `threshold` parameter let us specify a level of significance, below which variables are not shown." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "CorrPlot(housing_df.toMap(), \"Threshold: 0.5\", threshold = 0.5, adjustSize = 0.7)\n", " .tiles(type=\"full\", diag=false)\n", " .paletteBrBG()\n", " .build()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Let's further increase our threshold in order to see only highly correlated variables.\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "CorrPlot(housing_df.toMap(), \"Threshold: 0.8\", threshold = 0.8)\n", " .tiles(diag=false)\n", " .labels(color=\"white\", diag=false)\n", " .paletteBrBG()\n", " .build()" ] } ], "metadata": { "kernelspec": { "display_name": "Kotlin", "language": "kotlin", "name": "kotlin" }, "language_info": { "codemirror_mode": "text/x-kotlin", "file_extension": ".kt", "mimetype": "text/x-kotlin", "name": "kotlin", "nbconvert_exporter": "", "pygments_lexer": "kotlin", "version": "1.7.20-dev-1299" } }, "nbformat": 4, "nbformat_minor": 4 }