{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Correlation Plot\n", "\n", "The `CorrPlot` builder takes a dataframe (Kotlin `Map<*, *>`) as the input and builds a correlation plot.\n", "\n", "If the input has NxN shape and contains only numbers in range [0..1], then it is plotted as is. Otherwise `CorrPlot` will compute correlation coefficients using the Pearson's method. \n", "\n", "`CorrPlot` allows to combine 'tile', 'point' or 'label' layers in a matrix of \"full\", \"lower\" or \"upper\" type.\n", "\n", "A call to the terminal `build()` method will create a resulting 'plot' object. \n", "This 'plot' object can be further refined using regular Lets-Plot (ggplot) API, like `+ ggsize()` and so on.\n", "\n", "\n", "The Ames Housing dataset for this demo was downloaded from [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv) (train.csv), (c) Kaggle." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Lets-Plot Kotlin API v.4.1.1. Frontend: Notebook with dynamically loaded JS. Lets-Plot JS v.2.5.1." ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%useLatestDescriptors\n", "%use lets-plot\n", "\n", "LetsPlot.getInfo() // This prevents Krangl from loading an obsolete version of Lets-Plot classes." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%use krangl" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
1	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
2	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
3	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact

Shape: 3 x 12. \n", "

" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "// Cars MPG dataset\n", "var mpg_df = DataFrame.readCSV(\"https://raw.githubusercontent.com/JetBrains/lets-plot-kotlin/master/docs/examples/data/mpg.csv\")\n", "mpg_df.head(3)\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact

Shape: 3 x 11. \n", "

" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mpg_df = mpg_df.remove(\"\")\n", "mpg_df.head(3)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "val mpg_dat = mpg_df.toMap()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining 'tile', 'point' and 'label' layers.\n", "\n", "When combining layers, `CorrPlot` chooses an acceptable plot configuration by default." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", " " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gggrid(\n", " listOf(\n", " CorrPlot(mpg_dat, \"Tiles\").tiles().build(),\n", " CorrPlot(mpg_dat, \"Points\").points().build(), \n", " CorrPlot(mpg_dat, \"Tiles and labels\").tiles().labels().build(),\n", " CorrPlot(mpg_dat, \"Tiles, points and labels\").points().labels().tiles().build()\n", " ), 2, 400, 320)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The default plot configuration adapts to the changing options - compare \"Tiles and labels\" plot above and below.\n", "\n", "You can also override the default plot configuration using the parameter `type` - compare \"Tiles, points and labels\" plot above and below." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", " " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gggrid(\n", " listOf(\n", " CorrPlot(mpg_dat, \"Tiles and labels\").tiles().labels(color=\"white\").build(),\n", " CorrPlot(mpg_dat, \"Tiles, points and labels\")\n", " .tiles(type=\"upper\")\n", " .points(type=\"lower\")\n", " .labels(type=\"full\").build()\n", " ), 2, 400, 320)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Customizing colors.\n", "\n", "Instead of the default blue-grey-red gradient you can define your own lower-middle-upper colors, or \n", "choose one of the available 'Brewer' diverging palettes.\n", "\n", "Let's create a gradient resembling one of Seaborn gradients." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", " " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val corrPlot = CorrPlot(mpg_dat).points().labels().tiles()\n", "\n", "// Configure gradient resembling one of Seaborn gradients.\n", "val withGradientColors = (corrPlot\n", " .paletteGradient(low=\"#417555\", mid=\"#EDEDED\", high=\"#963CA7\")\n", " .build()) + ggtitle(\"Custom gradient\")\n", "\n", "// Configure Brewer 'BrBG' palette.\n", "val withBrewerColors = (corrPlot\n", " .paletteSpectral()\n", " .build()) + ggtitle(\"Brewer 'Spectral'\")\n", "\n", "// Show both plots\n", "gggrid(listOf(withGradientColors, withBrewerColors), 2, 400, 320)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Correlation plot with large number of variables in dataset.\n", "\n", "The [Kaggle House Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv) dataset contains 81 variables." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtUnfSF	TotalBsmtSF	Heating	HeatingQC	CentralAir	Electrical	1stFlrSF	2ndFlrSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
1	60	RL	65	8450	Pave	null	Reg	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2003	2003	Gable	CompShg	VinylSd	VinylSd	BrkFace	196	Gd	TA	PConc	Gd	TA	No	GLQ	706	Unf	150	856	GasA	Ex	Y	SBrkr	856	854	1710	1	0	2	1	3	1	Gd	8	Typ	0	null	Attchd	2003	RFn	2	548	TA	TA	Y	0	61	null	null	null	2	2008	WD	Normal	208500
2	20	RL	80	9600	Pave	null	Reg	Lvl	AllPub	FR2	Gtl	Veenker	Feedr	Norm	1Fam	1Story	6	8	1976	1976	Gable	CompShg	MetalSd	MetalSd	None	0	TA	TA	CBlock	Gd	TA	Gd	ALQ	978	Unf	284	1262	GasA	Ex	Y	SBrkr	1262	0	1262	0	1	2	0	3	1	TA	6	Typ	1	TA	Attchd	1976	RFn	2	460	TA	TA	Y	298	0	null	null	null	5	2007	WD	Normal	181500
3	60	RL	68	11250	Pave	null	IR1	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2001	2002	Gable	CompShg	VinylSd	VinylSd	BrkFace	162	Gd	TA	PConc	Gd	TA	Mn	GLQ	486	Unf	434	920	GasA	Ex	Y	SBrkr	920	866	1786	1	0	2	1	3	1	Gd	6	Typ	1	TA	Attchd	2001	RFn	2	608	TA	TA	Y	0	42	null	null	null	9	2008	WD	Normal	223500

Shape: 3 x 81. \n", "

" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val housing_df = DataFrame.readCSV(\"../data/Ames_house_prices_train.csv\")\n", "housing_df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Correlation plot that shows all the correlations in this dataset is too large and barely useful. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", " " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "CorrPlot(housing_df.toMap())\n", " .tiles(type=\"lower\")\n", " .paletteBrBG()\n", " .build()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### The `threshold` parameter.\n", "\n", "The `threshold` parameter let us specify a level of significance, below which variables are not shown." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", " " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "CorrPlot(housing_df.toMap(), \"Threshold: 0.5\", threshold = 0.5, adjustSize = 0.7)\n", " .tiles(type=\"full\", diag=false)\n", " .paletteBrBG()\n", " .build()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Let's further increase our threshold in order to see only highly correlated variables.\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", " " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "CorrPlot(housing_df.toMap(), \"Threshold: 0.8\", threshold = 0.8)\n", " .tiles(diag=false)\n", " .labels(color=\"white\", diag=false)\n", " .paletteBrBG()\n", " .build()" ] } ], "metadata": { "kernelspec": { "display_name": "Kotlin", "language": "kotlin", "name": "kotlin" }, "language_info": { "codemirror_mode": "text/x-kotlin", "file_extension": ".kt", "mimetype": "text/x-kotlin", "name": "kotlin", "nbconvert_exporter": "", "pygments_lexer": "kotlin", "version": "1.7.20-dev-1299" } }, "nbformat": 4, "nbformat_minor": 4 }