## Correlation Plot

The `CorrPlot` builder takes a dataframe (Kotlin `Map<*, *>`) as the input and builds a correlation plot.

If the input has NxN shape and contains only numbers in range [0..1], then it is plotted as is. Otherwise `CorrPlot` will compute correlation coefficients using the Pearson's method. 

`CorrPlot` allows to combine 'tile', 'point' or 'label' layers in a matrix of "full", "lower" or "upper" type.

A call to the terminal `build()` method will create a resulting 'plot' object. 
This 'plot' object can be further refined using regular Lets-Plot (ggplot) API, like `+ ggsize()` and so on.


The Ames Housing dataset for this demo was downloaded from [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv) (train.csv), (c) Kaggle.

In [1]:
%useLatestDescriptors
%use lets-plot

LetsPlot.getInfo()  // This prevents Krangl from loading an obsolete version of Lets-Plot classes.

Lets-Plot Kotlin API v.4.1.1. Frontend: Notebook with dynamically loaded JS. Lets-Plot JS v.2.5.1.

In [2]:
%use krangl

In [3]:
// Cars MPG dataset
var mpg_df = DataFrame.readCSV("https://raw.githubusercontent.com/JetBrains/lets-plot-kotlin/master/docs/examples/data/mpg.csv")
mpg_df.head(3)


Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact


In [4]:
mpg_df = mpg_df.remove("")
mpg_df.head(3)

manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact


In [5]:
val mpg_dat = mpg_df.toMap()

### Combining 'tile', 'point' and 'label' layers.

When combining layers, `CorrPlot` chooses an acceptable plot configuration by default.

In [6]:
gggrid(
    listOf(
        CorrPlot(mpg_dat, "Tiles").tiles().build(),
        CorrPlot(mpg_dat, "Points").points().build(), 
        CorrPlot(mpg_dat, "Tiles and labels").tiles().labels().build(),
        CorrPlot(mpg_dat, "Tiles, points and labels").points().labels().tiles().build()
    ), 2, 400, 320)

The default plot configuration adapts to the changing options - compare "Tiles and labels" plot above and below.

You can also override the default plot configuration using the parameter `type` - compare "Tiles, points and labels" plot above and below.

In [7]:
gggrid(
    listOf(
        CorrPlot(mpg_dat, "Tiles and labels").tiles().labels(color="white").build(),
        CorrPlot(mpg_dat, "Tiles, points and labels")
         .tiles(type="upper")
         .points(type="lower")
         .labels(type="full").build()
    ), 2, 400, 320)

### Customizing colors.

Instead of the default blue-grey-red gradient you can define your own lower-middle-upper colors, or 
choose one of the available 'Brewer' diverging palettes.

Let's create a gradient resembling one of Seaborn gradients.

In [8]:
val corrPlot = CorrPlot(mpg_dat).points().labels().tiles()

// Configure gradient resembling one of Seaborn gradients.
val withGradientColors = (corrPlot
            .paletteGradient(low="#417555", mid="#EDEDED", high="#963CA7")
            .build()) + ggtitle("Custom gradient")

// Configure Brewer 'BrBG' palette.
val withBrewerColors = (corrPlot
            .paletteSpectral()
            .build()) + ggtitle("Brewer 'Spectral'")

// Show both plots
gggrid(listOf(withGradientColors, withBrewerColors), 2, 400, 320)


### Correlation plot with large number of variables in dataset.

The [Kaggle House Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv) dataset contains 81 variables.

In [9]:
val housing_df = DataFrame.readCSV("../data/Ames_house_prices_train.csv")
housing_df.head(3)

Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1,60,RL,65,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500



Correlation plot that shows all the correlations in this dataset is too large and barely useful. 

In [10]:
CorrPlot(housing_df.toMap())
    .tiles(type="lower")
    .paletteBrBG()
    .build()


#### The `threshold` parameter.

The `threshold` parameter let us specify a level of significance, below which variables are not shown.

In [11]:
CorrPlot(housing_df.toMap(), "Threshold: 0.5", threshold = 0.5, adjustSize = 0.7)
    .tiles(type="full", diag=false)
    .paletteBrBG()
    .build()


Let's further increase our threshold in order to see only highly correlated variables.


In [12]:
CorrPlot(housing_df.toMap(), "Threshold: 0.8", threshold = 0.8)
    .tiles(diag=false)
    .labels(color="white", diag=false)
    .paletteBrBG()
    .build()