{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Correlation Plot\n", "\n", "The `CorrPlot` builder takes a dataframe (Kotlin `Map<*, *>`) as the input and builds a correlation plot.\n", "\n", "If the input has NxN shape and contains only numbers in range [0..1], then it is plotted as is. Otherwise `CorrPlot` will compute correlation coefficients using the Pearson's method. \n", "\n", "`CorrPlot` allows to combine 'tile', 'point' or 'label' layers in a matrix of \"full\", \"lower\" or \"upper\" type.\n", "\n", "A call to the terminal `build()` method will create a resulting 'plot' object. \n", "This 'plot' object can be further refined using regular Lets-Plot (ggplot) API, like `+ ggsize()` and so on.\n", "\n", "\n", "The Ames Housing dataset for this demo was downloaded from [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv) (train.csv), (c) Kaggle." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Lets-Plot Kotlin API v.4.1.1. Frontend: Notebook with dynamically loaded JS. Lets-Plot JS v.2.5.1." ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%useLatestDescriptors\n", "%use lets-plot\n", "\n", "LetsPlot.getInfo() // This prevents Krangl from loading an obsolete version of Lets-Plot classes." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%use krangl" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
manufacturermodeldisplyearcyltransdrvctyhwyflclass
1audia41.819994auto(l5)f1829pcompact
2audia41.819994manual(m5)f2129pcompact
3audia42.020084manual(m6)f2031pcompact

Shape: 3 x 12. \n", "

" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "// Cars MPG dataset\n", "var mpg_df = DataFrame.readCSV(\"https://raw.githubusercontent.com/JetBrains/lets-plot-kotlin/master/docs/examples/data/mpg.csv\")\n", "mpg_df.head(3)\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
manufacturermodeldisplyearcyltransdrvctyhwyflclass
audia41.819994auto(l5)f1829pcompact
audia41.819994manual(m5)f2129pcompact
audia42.020084manual(m6)f2031pcompact

Shape: 3 x 11. \n", "

" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mpg_df = mpg_df.remove(\"\")\n", "mpg_df.head(3)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "val mpg_dat = mpg_df.toMap()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining 'tile', 'point' and 'label' layers.\n", "\n", "When combining layers, `CorrPlot` chooses an acceptable plot configuration by default." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gggrid(\n", " listOf(\n", " CorrPlot(mpg_dat, \"Tiles\").tiles().build(),\n", " CorrPlot(mpg_dat, \"Points\").points().build(), \n", " CorrPlot(mpg_dat, \"Tiles and labels\").tiles().labels().build(),\n", " CorrPlot(mpg_dat, \"Tiles, points and labels\").points().labels().tiles().build()\n", " ), 2, 400, 320)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The default plot configuration adapts to the changing options - compare \"Tiles and labels\" plot above and below.\n", "\n", "You can also override the default plot configuration using the parameter `type` - compare \"Tiles, points and labels\" plot above and below." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gggrid(\n", " listOf(\n", " CorrPlot(mpg_dat, \"Tiles and labels\").tiles().labels(color=\"white\").build(),\n", " CorrPlot(mpg_dat, \"Tiles, points and labels\")\n", " .tiles(type=\"upper\")\n", " .points(type=\"lower\")\n", " .labels(type=\"full\").build()\n", " ), 2, 400, 320)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Customizing colors.\n", "\n", "Instead of the default blue-grey-red gradient you can define your own lower-middle-upper colors, or \n", "choose one of the available 'Brewer' diverging palettes.\n", "\n", "Let's create a gradient resembling one of Seaborn gradients." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val corrPlot = CorrPlot(mpg_dat).points().labels().tiles()\n", "\n", "// Configure gradient resembling one of Seaborn gradients.\n", "val withGradientColors = (corrPlot\n", " .paletteGradient(low=\"#417555\", mid=\"#EDEDED\", high=\"#963CA7\")\n", " .build()) + ggtitle(\"Custom gradient\")\n", "\n", "// Configure Brewer 'BrBG' palette.\n", "val withBrewerColors = (corrPlot\n", " .paletteSpectral()\n", " .build()) + ggtitle(\"Brewer 'Spectral'\")\n", "\n", "// Show both plots\n", "gggrid(listOf(withGradientColors, withBrewerColors), 2, 400, 320)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Correlation plot with large number of variables in dataset.\n", "\n", "The [Kaggle House Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv) dataset contains 81 variables." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilitiesLotConfigLandSlopeNeighborhoodCondition1Condition2BldgTypeHouseStyleOverallQualOverallCondYearBuiltYearRemodAddRoofStyleRoofMatlExterior1stExterior2ndMasVnrTypeMasVnrAreaExterQualExterCondFoundationBsmtQualBsmtCondBsmtExposureBsmtFinType1BsmtFinSF1BsmtFinType2BsmtFinSF2BsmtUnfSFTotalBsmtSFHeatingHeatingQCCentralAirElectrical1stFlrSF2ndFlrSFLowQualFinSFGrLivAreaBsmtFullBathBsmtHalfBathFullBathHalfBathBedroomAbvGrKitchenAbvGrKitchenQualTotRmsAbvGrdFunctionalFireplacesFireplaceQuGarageTypeGarageYrBltGarageFinishGarageCarsGarageAreaGarageQualGarageCondPavedDriveWoodDeckSFOpenPorchSFEnclosedPorch3SsnPorchScreenPorchPoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
160RL658450PavenullRegLvlAllPubInsideGtlCollgCrNormNorm1Fam2Story7520032003GableCompShgVinylSdVinylSdBrkFace196GdTAPConcGdTANoGLQ706Unf0150856GasAExYSBrkr85685401710102131Gd8Typ0nullAttchd2003RFn2548TATAY0610000nullnullnull022008WDNormal208500
220RL809600PavenullRegLvlAllPubFR2GtlVeenkerFeedrNorm1Fam1Story6819761976GableCompShgMetalSdMetalSdNone0TATACBlockGdTAGdALQ978Unf02841262GasAExYSBrkr1262001262012031TA6Typ1TAAttchd1976RFn2460TATAY29800000nullnullnull052007WDNormal181500
360RL6811250PavenullIR1LvlAllPubInsideGtlCollgCrNormNorm1Fam2Story7520012002GableCompShgVinylSdVinylSdBrkFace162GdTAPConcGdTAMnGLQ486Unf0434920GasAExYSBrkr92086601786102131Gd6Typ1TAAttchd2001RFn2608TATAY0420000nullnullnull092008WDNormal223500

Shape: 3 x 81. \n", "

" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val housing_df = DataFrame.readCSV(\"../data/Ames_house_prices_train.csv\")\n", "housing_df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Correlation plot that shows all the correlations in this dataset is too large and barely useful. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "CorrPlot(housing_df.toMap())\n", " .tiles(type=\"lower\")\n", " .paletteBrBG()\n", " .build()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### The `threshold` parameter.\n", "\n", "The `threshold` parameter let us specify a level of significance, below which variables are not shown." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "CorrPlot(housing_df.toMap(), \"Threshold: 0.5\", threshold = 0.5, adjustSize = 0.7)\n", " .tiles(type=\"full\", diag=false)\n", " .paletteBrBG()\n", " .build()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Let's further increase our threshold in order to see only highly correlated variables.\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "CorrPlot(housing_df.toMap(), \"Threshold: 0.8\", threshold = 0.8)\n", " .tiles(diag=false)\n", " .labels(color=\"white\", diag=false)\n", " .paletteBrBG()\n", " .build()" ] } ], "metadata": { "kernelspec": { "display_name": "Kotlin", "language": "kotlin", "name": "kotlin" }, "language_info": { "codemirror_mode": "text/x-kotlin", "file_extension": ".kt", "mimetype": "text/x-kotlin", "name": "kotlin", "nbconvert_exporter": "", "pygments_lexer": "kotlin", "version": "1.7.20-dev-1299" } }, "nbformat": 4, "nbformat_minor": 4 }