{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2D Kernel Density Distributions Using Plotly"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### ABOUT THE AUTHOR:\n",
"This notebook was contributed by [Plotly user Emilia Petrisor](https://plotly.com/~empet). You can follow Emilia on Twitter [@mathinpython](https://twitter.com/mathinpython) or [GitHub](https://github.com/empet)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Introduction:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have two `Excel` files with two columns. We read the files into two `pandas` dataframes and plot\n",
"for each of them an estimate of the joint distribution of the corresponding two columns. The joint distribution is calcalutated by `scipy.stats.gaussian_kde` [function](http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.gaussian_kde.html). "
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import numpy as np\n",
"import scipy.stats as st\n",
"\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read the first file:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Index([u'multiannual', u'bachelor-th'], dtype='object')"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xl = pd.ExcelFile(\"Data/CSCEng.xls\")\n",
"dfc = xl.parse(\"Sheet1\")\n",
"dfc.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and the seconed one:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Index([u'multiannual', u'bachelor-th'], dtype='object')"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xl = pd.ExcelFile(\"Data/SystEng.xls\")\n",
"dfi = xl.parse(\"Sheet1\")\n",
"dfi.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The contour plot of the joint distribution of two variables (columns) is colored with a custom colorscale: "
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"cubehelix_cs=[[0.0, '#fcf9f7'],\n",
" [0.16666666666666666, '#edcfc9'],\n",
" [0.3333333333333333, '#daa2ac'],\n",
" [0.5, '#bc7897'],\n",
" [0.6666666666666666, '#925684'],\n",
" [0.8333333333333333, '#5f3868'],\n",
" [1.0, '#2d1e3e']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function `kde_scipy` returns data for Plotly contour plot of the estimated 2D distribution:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def kde_scipy( vals1, vals2, (a,b), (c,d), N ):\n",
" \n",
" #vals1, vals2 are the values of two variables (columns)\n",
" #(a,b) interval for vals1; usually larger than (np.min(vals1), np.max(vals1))\n",
" #(c,d) -\"- vals2 \n",
" \n",
" x=np.linspace(a,b,N)\n",
" y=np.linspace(c,d,N)\n",
" X,Y=np.meshgrid(x,y)\n",
" positions = np.vstack([Y.ravel(), X.ravel()])\n",
"\n",
" values = np.vstack([vals1, vals2])\n",
" kernel = st.gaussian_kde(values)\n",
" Z = np.reshape(kernel(positions).T, X.shape)\n",
" \n",
" return [x, y, Z]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Contour plot of the joint distribution of data from the first file ###"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import plotly.plotly as py\n",
"from plotly.graph_objs import * "
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def make_kdeplot(varX, varY, (a,b), (c,d), N, colorsc, title):\n",
" #varX, varY are lists, 1d numpy.array(s), or dataframe columns, storing the values of two variables\n",
" \n",
" x, y, Z = kde_scipy(varY, varX, (a,b), (c,d), N )\n",
" \n",
" data = Data([\n",
" Contour(\n",
" z=Z, \n",
" x=x,\n",
" y=y,\n",
" colorscale=colorsc,\n",
" #reversescale=True,\n",
" opacity=0.9, \n",
" contours=Contours(\n",
" showlines=False) \n",
" ), \n",
" ])\n",
"\n",
" layout = Layout(\n",
" title= title, \n",
" font= Font(family='Georgia, serif', color='#635F5D'),\n",
" showlegend=False,\n",
" autosize=False,\n",
" width=650,\n",
" height=650,\n",
" xaxis=XAxis(\n",
" range=[a,b],\n",
" showgrid=False,\n",
" nticks=7\n",
" ),\n",
" yaxis=YAxis(\n",
" range=[c,d],\n",
" showgrid=False,\n",
" nticks=7\n",
" ),\n",
" margin=Margin(\n",
" l=40,\n",
" r=40,\n",
" b=85,\n",
" t=100,\n",
" ),\n",
" )\n",
" \n",
" return Figure( data=data, layout=layout )"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"N=200\n",
"a,b=(5,11)\n",
"fig=make_kdeplot(dfc['multiannual'], dfc['bachelor-th'], (a,b), (a,b), \n",
" N, cubehelix_cs,'kde plot of two sets of data' )\n",
"\n",
"py.sign_in('empet', 'my_api_key')\n",
"py.iplot(fig, filename='kde-2D-CSCE')"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### Contour plot of the joint distribution of data from the second file ###"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a, b=(4,12)\n",
"fig=make_kdeplot(dfi['multiannual'], dfi['bachelor-th'], (a,b), (a,b),\n",
" N, cubehelix_cs, 'kde plot of two sets of data')\n",
"py.iplot(fig, filename='kde-2D-SE')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One notices that the second contourplot illustrates a [mixture of two bivariate\n",
"distributions](https://en.wikipedia.org/wiki/Mixture_distribution)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally we read a dataframe from a csv file posted on the Plotly's github account, select the rows corresponding to `Iris-virginica`, and plot the joint distribution of two virginica features:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/iris.csv')\n",
"virginica = df.loc[df.Name == \"Iris-virginica\"]\n",
"a, b=(5,8.5)\n",
"c,d=(2,4)\n",
"N=100\n",
"fig=make_kdeplot(virginica.SepalLength, virginica.SepalWidth, (a,b), (c,d),\n",
" N, cubehelix_cs, 'kde plot of joint distribution for virginica SepalLength and SepalWidth')\n",
"py.iplot(fig, filename='virginica-sepal-length-vs-width')\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.core.display import HTML\n",
"def css_styling():\n",
" styles = open(\"./custom.css\", \"r\").read()\n",
" return HTML(styles)\n",
"css_styling()\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"\n",
"from IPython.display import HTML, display\n",
"\n",
"display(HTML(''))\n",
"display(HTML(''))\n",
"\n",
"import publisher\n",
"publisher.publish('2d-kernel-density-distributions', '/ipython-notebooks/2d-kernel-density-distributions/', \n",
" '2d Kernel Density Distributions', \n",
" '2D Kernel Density Distributions Using Plotly')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}