{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "###
We will use the popular open source library Plotly for data visualization.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here you will see how to draw interactive graphics easily and with a minimum of code. Plotly is an extremely powerful tool and it is impossible to cover all its features at once, so I'll show you how to build the most relevant and interesting graphics." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Pie \n", "- Bar \n", "- Scatter \n", "- Box \n", "- Choropleth " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, if you don't already have Plotly installed, run:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#!pip install plotly" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import plotly.offline as py\n", "import pycountry\n", "import seaborn as sns\n", "from matplotlib import pyplot as plt\n", "\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "py.init_notebook_mode(connected=True)\n", "import plotly.graph_objs as go" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "PATH = \"120-years-of-olympic-history-athletes-and-results/athlete_events.csv\"\n", "data = pd.read_csv(PATH)\n", "data.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's download the file `athlete_events.csv` from [Kaggle page](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results). The dataset has the following features:\n", "\n", "- __ID__ - Unique number for each athlete\n", "- __Name__ - Athlete's name\n", "- __Sex__ - M or F\n", "- __Age__ - Integer\n", "- __Height__ - In centimeters\n", "- __Weight__ - In kilograms\n", "- __Team__ - Team name\n", "- __NOC__ - National Olympic Committee 3-letter code\n", "- __Games__ - Year and season\n", "- __Year__ - Integer\n", "- __Season__ - Summer or Winter\n", "- __City__ - Host city\n", "- __Sport__ - Sport\n", "- __Event__ - Event\n", "- __Medal__ - Gold, Silver, Bronze, or NA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Pie " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's draw the simplest graph possible. Display the percentages of three types of medals among the total number of medals." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "colors = [\"#f4cb42\", \"#cd7f32\", \"#a1a8b5\"] # gold,bronze,silver\n", "medal_counts = data.Medal.value_counts(sort=True)\n", "labels = medal_counts.index\n", "values = medal_counts.values\n", "\n", "pie = go.Pie(labels=labels, values=values, marker=dict(colors=colors))\n", "layout = go.Layout(title=\"Medal distribution\")\n", "fig = go.Figure(data=[pie], layout=layout)\n", "py.iplot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All the main classes for drawing graphs are located in plotly.graph_objs as go:\n", "\n", "- __go.Pie__ is a graph object with any of the named arguments or attributes listed below. \n", "- __go.Layout__ allows you to customize axis labels, titles, fonts, sizes, margins, colors, and more to define the appearance of the chart.\n", "- __go.Figure__ just creates the final object to be plotted, and simply just creates a dictionary-like object that contains both the data object and the layout object." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, that was too easy. Let's complicate the chart a bit, we will use two Pie on one chart.\n", "\n", "We will display the top 10 countries whose athletes win any medals. Separate for men and women." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "topn = 10\n", "male = data[data.Sex == \"M\"]\n", "female = data[data.Sex == \"F\"]\n", "count_male = male.dropna().NOC.value_counts()[:topn].reset_index()\n", "count_female = female.dropna().NOC.value_counts()[:topn].reset_index()\n", "\n", "pie_men = go.Pie(\n", " labels=count_male[\"index\"],\n", " values=count_male.NOC,\n", " name=\"Men\",\n", " hole=0.4,\n", " domain={\"x\": [0, 0.46]},\n", ")\n", "pie_women = go.Pie(\n", " labels=count_female[\"index\"],\n", " values=count_female.NOC,\n", " name=\"Women\",\n", " hole=0.4,\n", " domain={\"x\": [0.5, 1]},\n", ")\n", "\n", "layout = dict(\n", " title=\"Top-10 countries with medals by sex\",\n", " font=dict(size=15),\n", " legend=dict(orientation=\"h\"),\n", " annotations=[\n", " dict(x=0.2, y=0.5, text=\"Men\", showarrow=False, font=dict(size=20)),\n", " dict(x=0.8, y=0.5, text=\"Women\", showarrow=False, font=dict(size=20)),\n", " ],\n", ")\n", "\n", "fig = dict(data=[pie_men, pie_women], layout=layout)\n", "py.iplot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Parameter __hole__ sets the size of the hole in the center of the pie\n", "- Parameter __domain__ sets the offset. The X array set the horizontal position whilst the Y array sets the vertical. For example, x: [0,0.5], y: [0, 0.5] would mean the bottom left position of the plot.\n", "- Dict __annotations__ sets the format of the text inside the Pie.\n", "- To learn more, read the go.Pie documentation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Bar " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, we can not do without the Bar. \n", "\n", "Let's draw a bar chart of the number of sports in different years." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "games = data[data.Season == \"Summer\"].Games.unique()\n", "games.sort()\n", "sport_counts = np.array(\n", " [data[data.Games == game].groupby(\"Sport\").size().shape[0] for game in games]\n", ")\n", "bar = go.Bar(\n", " x=games,\n", " y=sport_counts,\n", " marker=dict(color=sport_counts, colorscale=\"Reds\", showscale=True),\n", ")\n", "layout = go.Layout(title=\"Number of sports in the summer Olympics by year\")\n", "fig = go.Figure(data=[bar], layout=layout)\n", "py.iplot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The whole rendering scheme is the same, now the base class is __go.Bar__.\n", "\n", "- Dictionary __marker__ sets the drawing style of the chart and allows you to display the color scale\n", "- To learn more, read the go.Bar documentation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, let's complicate the graph and display the number of different medals for the top 10 countries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "topn = 10\n", "top10 = data.dropna().NOC.value_counts()[:topn]\n", "\n", "gold = data[data.Medal == \"Gold\"].NOC.value_counts()\n", "gold = gold[top10.index]\n", "silver = data[data.Medal == \"Silver\"].NOC.value_counts()\n", "silver = silver[top10.index]\n", "bronze = data[data.Medal == \"Bronze\"].NOC.value_counts()\n", "bronze = bronze[top10.index]\n", "\n", "bar_gold = go.Bar(x=gold.index, y=gold, name=\"Gold\", marker=dict(color=\"#f4cb42\"))\n", "bar_silver = go.Bar(\n", " x=silver.index, y=silver, name=\"Silver\", marker=dict(color=\"#a1a8b5\")\n", ")\n", "bar_bronze = go.Bar(\n", " x=bronze.index, y=bronze, name=\"Bronze\", marker=dict(color=\"#cd7f32\")\n", ")\n", "\n", "layout = go.Layout(\n", " title=\"Top-10 countries with medals\", yaxis=dict(title=\"Count of medals\")\n", ")\n", "\n", "fig = go.Figure(data=[bar_gold, bar_silver, bar_bronze], layout=layout)\n", "py.iplot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Scatter " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's draw a beautiful scatter plot showing average height and weight for athletes from different sports.\n", "\n", "We will make circles of different sizes depending on the popularity of the sport and, as a result, the sample size of athletes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tmp = data.groupby([\"Sport\"])[\"Height\", \"Weight\"].agg(\"mean\").dropna()\n", "df1 = pd.DataFrame(tmp).reset_index()\n", "tmp = data.groupby([\"Sport\"])[\"ID\"].count()\n", "df2 = pd.DataFrame(tmp).reset_index()\n", "dataset = df1.merge(df2) # DataFrame with columns 'Sport', 'Height', 'Weight', 'ID'\n", "\n", "scatterplots = list()\n", "for sport in dataset[\"Sport\"]:\n", " df = dataset[dataset[\"Sport\"] == sport]\n", " trace = go.Scatter(\n", " x=df[\"Height\"],\n", " y=df[\"Weight\"],\n", " name=sport,\n", " marker=dict(symbol=\"circle\", sizemode=\"area\", sizeref=10, size=df[\"ID\"]),\n", " )\n", " scatterplots.append(trace)\n", "\n", "layout = go.Layout(\n", " title=\"Mean height and weight by sport\",\n", " xaxis=dict(title=\"Height, cm\"),\n", " yaxis=dict(title=\"Weight, kg\"),\n", " showlegend=True,\n", ")\n", "\n", "fig = dict(data=scatterplots, layout=layout)\n", "py.iplot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It was beautiful, wasn't it? We can interactively remove the sport we are interested in, zoom in and analyze the charts in every possible way.\n", "\n", "- Dictionary __marker__ again defines the drawing view, sets the shape type (try, for example, square), dimensions, and more. The possibilities are almost endless.\n", "- To learn more, read the go.Scatter documentation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Box " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will display statistics on age for men and women participating in the Olympics with Boxplot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "men = data[data.Sex == \"M\"].Age\n", "women = data[data.Sex == \"F\"].Age\n", "\n", "box_m = go.Box(x=men, name=\"Male\", fillcolor=\"navy\")\n", "box_w = go.Box(x=women, name=\"Female\", fillcolor=\"lime\")\n", "layout = go.Layout(title=\"Age by sex\")\n", "fig = go.Figure(data=[box_m, box_w], layout=layout)\n", "py.iplot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- This graph describes the distribution of the data. The center vertical line corresponds to the median, and the boundaries of the rectangle correspond to the first and third quartiles. The points show the outliers. In addition, you can see the minimum and maximum values.\n", "- Who do you think is the __youngest__ (10 y.o.) and the __oldest__ (97 y.o.) participant of the Olympics? Find them :)\n", "- To learn more, read the go.Box documentation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Choropleth " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's determine how many participants were sent by different countries during the whole period of the Olympics." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#!pip install pycountry" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_name(code):\n", " \"\"\"\n", " Translate code to name of the country\n", " \"\"\"\n", " try:\n", " name = pycountry.countries.get(alpha_3=code).name\n", " except:\n", " name = code\n", " return name\n", "\n", "\n", "country_number = pd.DataFrame(data.NOC.value_counts())\n", "country_number[\"country\"] = country_number.index\n", "country_number.columns = [\"number\", \"country\"]\n", "country_number.reset_index().drop(columns=[\"index\"], inplace=True)\n", "country_number[\"country\"] = country_number[\"country\"].apply(lambda c: get_name(c))\n", "country_number.head(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "worldmap = [\n", " dict(\n", " type=\"choropleth\",\n", " locations=country_number[\"country\"],\n", " locationmode=\"country names\",\n", " z=country_number[\"number\"],\n", " autocolorscale=True,\n", " reversescale=False,\n", " marker=dict(line=dict(color=\"rgb(180,180,180)\", width=0.5)),\n", " colorbar=dict(autotick=False, title=\"Number of athletes\"),\n", " )\n", "]\n", "\n", "layout = dict(\n", " title=\"The Nationality of Athletes\",\n", " geo=dict(showframe=False, showcoastlines=True, projection=dict(type=\"Mercator\")),\n", ")\n", "\n", "fig = dict(data=worldmap, layout=layout)\n", "py.iplot(fig, validate=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- To learn more, read the go.Box documentation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's it, you've learned how plotly works and mastered simple but beautiful interactive graphics." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Useful links\n", "\n", "- Plotly website\n", "- Plotly for beginners\n", "- Plotly tutorials" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }