{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A Python Tour of Data Science: Data Visualization\n", "\n", "[Michaƫl Defferrard](http://deff.ch), *PhD student*, [EPFL](http://epfl.ch) [LTS2](http://lts2.epfl.ch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise\n", "\n", "Data visualization is a key aspect of exploratory data analysis.\n", "During this exercise we'll gradually build more and more complex vizualisations. We'll do this by replicating plots. Try to reproduce the lines but also the axis labels, legends or titles.\n", "\n", "* Goal of data visualization: clearly and efficiently communicate information through visual representations. While tables are generally used to look up a specific measurement, charts are used to show patterns or relationships.\n", "* Means: mainly statistical graphics for exploratory analysis, e.g. scatter plots, histograms, probability plots, box plots, residual plots, but also [infographics](https://en.wikipedia.org/wiki/Infographic) for communication.\n", "\n", "*Data visualization is both an art and a science. It should combine both aesthetic form and functionality.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1 Time series\n", "\n", "To start slowly, let's make a static line plot from some time series. Reproduce the plots below using:\n", "1. The procedural API of [matplotlib](http://matplotlib.org), the main data visualization library for Python. Its procedural API is similar to matlab and convenient for interactive work.\n", "2. [Pandas](http://pandas.pydata.org), which wraps matplotlib around his DataFrame format and makes many standard plots easy to code. It offers many [helpers for data visualization](http://pandas.pydata.org/pandas-docs/version/0.19.1/visualization.html).\n", "\n", "**Hint**: to plot with pandas, you first need to create a DataFrame, pandas' tabular data format." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "# Random time series.\n", "n = 1000\n", "rs = np.random.RandomState(42)\n", "data = rs.randn(n, 4).cumsum(axis=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.figure(figsize=(15,5))\n", "plt.plot(data[:, 0], label='A')\n", "plt.plot(data[:, 1], '.-k', label='B')\n", "plt.plot(data[:, 2], '--m', label='C')\n", "plt.plot(data[:, 3], ':', label='D')\n", "plt.legend(loc='upper left')\n", "plt.xticks(range(0, 1000, 50))\n", "plt.ylabel('Value')\n", "plt.xlabel('Day')\n", "plt.grid()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "idx = pd.date_range('1/1/2000', periods=n)\n", "df = pd.DataFrame(data, index=idx, columns=list('ABCD'))\n", "df.plot(figsize=(15,5));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2 Categories\n", "\n", "Categorical data is best represented by [bar](https://en.wikipedia.org/wiki/Bar_chart) or [pie](https://en.wikipedia.org/wiki/Pie_chart) charts. Reproduce the plots below using the object-oriented API of matplotlib, which is recommended for programming.\n", "\n", "**Question**: What are the pros / cons of each plot ?\n", "\n", "**Tip**: the [matplotlib gallery](http://matplotlib.org/gallery.html) is a convenient starting point." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = [10, 40, 25, 15, 10]\n", "categories = list('ABCDE')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 2, figsize=(15, 5))\n", "\n", "axes[1].pie(data, explode=[0,.1,0,0,0], labels=categories, autopct='%1.1f%%', startangle=90)\n", "axes[1].axis('equal')\n", "\n", "pos = range(len(data))\n", "axes[0].bar(pos, data, align='center')\n", "axes[0].set_xticks(pos)\n", "axes[0].set_xticklabels(categories)\n", "axes[0].set_xlabel('Category')\n", "axes[0].set_title('Allotment');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3 Frequency\n", "\n", "A frequency plot is a graph that shows the pattern in a set of data by plotting how often particular values of a measure occur. They often take the form of an [histogram](https://en.wikipedia.org/wiki/Histogram) or a [box plot](https://en.wikipedia.org/wiki/Box_plot).\n", "\n", "Reproduce the plots with the following three libraries, which provide high-level declarative syntax for statistical visualization as well as a convenient interface to pandas:\n", "* [Seaborn](http://seaborn.pydata.org) is a statistical visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. Its advantage is that you can modify the produced plots with matplotlib, so you loose nothing.\n", "* [ggplot](http://ggplot.yhathq.com) is a (partial) port of the popular [ggplot2](http://ggplot2.org) for R. It has his roots in the influencial book [the grammar of graphics](https://www.cs.uic.edu/~wilkinson/TheGrammarOfGraphics/GOG.html). Convenient if you know ggplot2 already.\n", "* [Vega](https://vega.github.io/) is a declarative format for statistical visualization based on [D3.js](https://d3js.org), a low-level javascript library for interactive visualization. [Vincent](https://vincent.readthedocs.io/en/latest/) (discontinued) and [altair](https://altair-viz.github.io/) are Python libraries to vega. Altair is quite new and does not provide all the needed functionality yet, but it is promising !\n", "\n", "**Hints**:\n", "* Seaborn, look at `distplot()` and `boxplot()`.\n", "* ggplot, we are interested by the [geom_histogram](http://ggplot.yhathq.com/docs/geom_histogram.html) geometry." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import seaborn as sns\n", "import os\n", "df = sns.load_dataset('iris', data_home=os.path.join('..', 'data'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 2, figsize=(15, 5))\n", "\n", "g = sns.distplot(df['petal_width'], kde=True, rug=False, ax=axes[0])\n", "g.set(title='Distribution of petal width')\n", "\n", "g = sns.boxplot('species', 'petal_width', data=df, ax=axes[1])\n", "g.set(title='Distribution of petal width by species');" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import ggplot\n", "\n", "ggplot.ggplot(df, ggplot.aes(x='petal_width', fill='species')) + \\\n", " ggplot.geom_histogram() + \\\n", " ggplot.ggtitle('Distribution of Petal Width by Species')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import altair\n", "\n", "altair.Chart(df).mark_bar(opacity=.75).encode(\n", " x=altair.X('petal_width', bin=altair.Bin(maxbins=30)),\n", " y='count(*)',\n", " color=altair.Color('species')\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4 Correlation\n", "\n", "[Scatter plots](https://en.wikipedia.org/wiki/Scatter_plot) are very much used to assess the correlation between 2 variables. Pair plots are then a useful way of displaying the pairwise relations between variables in a dataset.\n", "\n", "Use the seaborn `pairplot()` function to analyze how separable is the iris dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sns.pairplot(df, hue=\"species\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5 Dimensionality reduction\n", "\n", "Humans can only comprehend up to 3 dimensions (in space, then there is e.g. color or size), so [dimensionality reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction) is often needed to explore high dimensional datasets. Analyze how separable is the iris dataset by visualizing it in a 2D scatter plot after reduction from 4 to 2 dimensions with two popular methods:\n", "1. The classical [principal componant analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis).\n", "2. [t-distributed stochastic neighbor embedding (t-SNE)](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding).\n", "\n", "**Hints**:\n", "* t-SNE is a stochastic method, so you may want to run it multiple times.\n", "* The easiest way to create the scatter plot is to add columns to the pandas DataFrame, then use the Seaborn `swarmplot()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "from sklearn.manifold import TSNE" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pca = PCA(n_components=2)\n", "X = pca.fit_transform(df.values[:, :4])\n", "df['pca1'] = X[:, 0]\n", "df['pca2'] = X[:, 1]\n", "\n", "tsne = TSNE(n_components=2)\n", "X = tsne.fit_transform(df.values[:, :4])\n", "df['tsne1'] = X[:, 0]\n", "df['tsne2'] = X[:, 1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 2, figsize=(15, 5))\n", "sns.swarmplot(x='pca1', y='pca2', data=df, hue='species', ax=axes[0])\n", "sns.swarmplot(x='tsne1', y='tsne2', data=df, hue='species', ax=axes[1]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6 Interactive visualization\n", "\n", "For interactive visualization, look at [bokeh](http://bokeh.pydata.org) (we used it during the [data exploration exercise](http://nbviewer.jupyter.org/github/mdeff/ntds_2016/blob/with_outputs/toolkit/01_demo_acquisition_exploration.ipynb#4-Interactive-Visualization)) or [VisPy](http://vispy.org)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 7 Geographic map\n", "\n", "If you want to visualize data on an interactive map, look at [Folium](https://github.com/python-visualization/folium)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }