{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n", "\n", "Author: [Egor Polusmak](https://www.linkedin.com/in/egor-polusmak/). Translated and edited by Alena Sharlo, [Yury Kashnitsky](https://yorko.github.io), [Artem Trunov](https://www.linkedin.com/in/datamove), [Anastasia Manokhina](https://www.linkedin.com/in/anastasiamanokhina/), and [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#
Topic 2. Visual data analysis in Python\n", "##
Part 2. Overview of Seaborn, Matplotlib and Plotly libraries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Article outline\n", "\n", "1. [Dataset](#1.-Dataset)\n", "2. [DataFrame.plot()](#2.-DataFrame.plot)\n", "3. [Seaborn](#3.-Seaborn)\n", "4. [Plotly](#4.-Plotly)\n", "5. [Demo assignment](#5.-Demo-assignment)\n", "6. [Useful resources](#6.-Useful-resources)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Dataset\n", "\n", "First, we will set up our environment by importing all necessary libraries. We will also change the display settings to better show plots." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Matplotlib forms basis for visualization in Python\n", "import matplotlib.pyplot as plt\n", "# We will use the Seaborn library\n", "import seaborn as sns\n", "\n", "sns.set()\n", "\n", "# Graphics in SVG format are more sharp and legible\n", "%config InlineBackend.figure_format = 'svg'\n", "\n", "# Increase the default plot size and set the color scheme\n", "plt.rcParams[\"figure.figsize\"] = (8, 5)\n", "plt.rcParams[\"image.cmap\"] = \"viridis\"\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let’s load the dataset that we will be using into a `DataFrame`. I have picked a dataset on video game sales and ratings from [Kaggle Datasets](https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings).\n", "Some of the games in this dataset lack ratings; so, let’s filter for only those examples that have all of their values present." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"../../data/video_games_sales.csv\").dropna()\n", "print(df.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, print the summary of the `DataFrame` to check data types and to verify everything is non-null." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that `pandas` has loaded some of the numerical features as `object` type. We will explicitly convert those columns into `float` and `int`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[\"User_Score\"] = df[\"User_Score\"].astype(\"float64\")\n", "df[\"Year_of_Release\"] = df[\"Year_of_Release\"].astype(\"int64\")\n", "df[\"User_Count\"] = df[\"User_Count\"].astype(\"int64\")\n", "df[\"Critic_Count\"] = df[\"Critic_Count\"].astype(\"int64\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting `DataFrame` contains 6825 examples and 16 columns. Let’s look at the first few entries with the `head()` method to check that everything has been parsed correctly. To make it more convenient, I have listed only the variables that we will use in this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "useful_cols = [\n", " \"Name\",\n", " \"Platform\",\n", " \"Year_of_Release\",\n", " \"Genre\",\n", " \"Global_Sales\",\n", " \"Critic_Score\",\n", " \"Critic_Count\",\n", " \"User_Score\",\n", " \"User_Count\",\n", " \"Rating\",\n", "]\n", "df[useful_cols].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. DataFrame.plot()\n", "\n", "Before we turn to Seaborn and Plotly, let’s discuss the simplest and often most convenient way to visualize data from a `DataFrame`: using its own `plot()` method.\n", "\n", "As an example, we will create a plot of video game sales by country and year. First, let’s keep only the columns we need. Then, we will calculate the total sales by year and call the `plot()` method on the resulting `DataFrame`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[[x for x in df.columns if \"Sales\" in x] + [\"Year_of_Release\"]].groupby(\n", " \"Year_of_Release\"\n", ").sum().plot();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the implementation of the `plot()` method in `pandas` is based on `matplotlib`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the `kind` parameter, you can change the type of the plot to, for example, a *bar chart*. `matplotlib` is generally quite flexible for customizing plots. You can change almost everything in the chart, but you may need to dig into the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html) to find the corresponding parameters. For example, the parameter `rot` is responsible for the rotation angle of ticks on the x-axis (for vertical plots):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[[x for x in df.columns if \"Sales\" in x] + [\"Year_of_Release\"]].groupby(\n", " \"Year_of_Release\"\n", ").sum().plot(kind=\"bar\", rot=45);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Seaborn\n", "\n", "Now, let's move on to the `Seaborn` library. `seaborn` is essentially a higher-level API based on the `matplotlib` library. Among other things, it differs from the latter in that it contains more adequate default settings for plotting. By adding `import seaborn as sns; sns.set()` in your code, the images of your plots will become much nicer. Also, this library contains a set of complex tools for visualization that would otherwise (i.e. when using bare `matplotlib`) require quite a large amount of code.\n", "\n", "#### pairplot()\n", "\n", "Let's take a look at the first of such complex plots, a *pairwise relationships plot*, which creates a matrix of scatter plots by default. This kind of plot helps us visualize the relationship between different variables in a single output." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# `pairplot()` may become very slow with the SVG format\n", "%config InlineBackend.figure_format = 'png'\n", "sns.pairplot(\n", " df[[\"Global_Sales\", \"Critic_Score\", \"Critic_Count\", \"User_Score\", \"User_Count\"]]\n", ");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the distribution histograms lie on the diagonal of the matrix. The remaining charts are scatter plots for the corresponding pairs of features.\n", "\n", "#### distplot()\n", "\n", "It is also possible to plot a distribution of observations with `seaborn`'s `distplot()`. For example, let's look at the distribution of critics' ratings: `Critic_Score`. By default, the plot displays a histogram and the [kernel density estimate](https://en.wikipedia.org/wiki/Kernel_density_estimation)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%config InlineBackend.figure_format = 'svg'\n", "sns.distplot(df[\"Critic_Score\"]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### jointplot()\n", "\n", "To look more closely at the relationship between two numerical variables, you can use *joint plot*, which is a cross between a scatter plot and histogram. Let's see how the `Critic_Score` and `User_Score` features are related." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.jointplot(x=\"Critic_Score\", y=\"User_Score\", data=df, kind=\"scatter\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### boxplot()\n", "\n", "Another useful type of plot is a *box plot*. Let's compare critics' ratings for the top 5 biggest gaming platforms." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "top_platforms = (\n", " df[\"Platform\"].value_counts().sort_values(ascending=False).head(5).index.values\n", ")\n", "sns.boxplot(\n", " y=\"Platform\",\n", " x=\"Critic_Score\",\n", " data=df[df[\"Platform\"].isin(top_platforms)],\n", " orient=\"h\",\n", ");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is worth spending a bit more time to discuss how to interpret a box plot. Its components are a *box* (obviously, this is why it is called a *box plot*), the so-called *whiskers*, and a number of individual points (*outliers*).\n", "\n", "The box by itself illustrates the interquartile spread of the distribution; its length determined by the $25\\% \\, (\\text{Q1})$ and $75\\% \\, (\\text{Q3})$ percentiles. The vertical line inside the box marks the median ($50\\%$) of the distribution. \n", "\n", "The whiskers are the lines extending from the box. They represent the entire scatter of data points, specifically the points that fall within the interval $(\\text{Q1} - 1.5 \\cdot \\text{IQR}, \\text{Q3} + 1.5 \\cdot \\text{IQR})$, where $\\text{IQR} = \\text{Q3} - \\text{Q1}$ is the [interquartile range](https://en.wikipedia.org/wiki/Interquartile_range).\n", "\n", "Outliers that fall out of the range bounded by the whiskers are plotted individually.\n", "\n", "#### heatmap()\n", "\n", "The last type of plot that we will cover here is a *heat map*. A heat map allows you to view the distribution of a numerical variable over two categorical ones. Let’s visualize the total sales of games by genre and gaming platform." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "platform_genre_sales = (\n", " df.pivot_table(\n", " index=\"Platform\", columns=\"Genre\", values=\"Global_Sales\", aggfunc=sum\n", " )\n", " .fillna(0)\n", " .applymap(float)\n", ")\n", "sns.heatmap(platform_genre_sales, annot=True, fmt=\".1f\", linewidths=0.5);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Plotly\n", "\n", "We have examined some visualization tools based on the `matplotlib` library. However, this is not the only option for plotting in `Python`. Let’s take a look at the `plotly` library. Plotly is an open-source library that allows creation of interactive plots within a Jupyter notebook without having to use Javascript.\n", "\n", "The real beauty of interactive plots is that they provide a user interface for detailed data exploration. For example, you can see exact numerical values by mousing over points, hide uninteresting series from the visualization, zoom in onto a specific part of the plot, etc.\n", "\n", "Before we start, let’s import all the necessary modules and initialize `plotly` by calling the `init_notebook_mode()` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import plotly\n", "import plotly.graph_objs as go\n", "from plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plot\n", "\n", "init_notebook_mode(connected=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Line plot\n", "\n", "First of all, let’s build a *line plot* showing the number of games released and their sales by year." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "years_df = (\n", " df.groupby(\"Year_of_Release\")[[\"Global_Sales\"]]\n", " .sum()\n", " .join(df.groupby(\"Year_of_Release\")[[\"Name\"]].count())\n", ")\n", "years_df.columns = [\"Global_Sales\", \"Number_of_Games\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Figure` is the main class and a work horse of visualization in `plotly`. It consists of the data (an array of lines called `traces` in this library) and the style (represented by the `layout` object). In the simplest case, you may call the `iplot` function to return only `traces`.\n", "\n", "The `show_link` parameter toggles the visibility of the links leading to the online platform `plot.ly` in your charts. Most of the time, this functionality is not needed, so you may want to turn it off by passing `show_link=False` to prevent accidental clicks on those links." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a line (trace) for the global sales\n", "trace0 = go.Scatter(x=years_df.index, y=years_df[\"Global_Sales\"], name=\"Global Sales\")\n", "\n", "# Create a line (trace) for the number of games released\n", "trace1 = go.Scatter(\n", " x=years_df.index, y=years_df[\"Number_of_Games\"], name=\"Number of games released\"\n", ")\n", "\n", "# Define the data array\n", "data = [trace0, trace1]\n", "\n", "# Set the title\n", "layout = {\"title\": \"Statistics for video games\"}\n", "\n", "# Create a Figure and plot it\n", "fig = go.Figure(data=data, layout=layout)\n", "iplot(fig, show_link=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an option, you can save the plot in an html file:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotly.offline.plot(fig, filename=\"years_stats.html\", show_link=False);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Bar chart\n", "\n", "Let's use a *bar chart* to compare the market share of different gaming platforms broken down by the number of new releases and by total revenue." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Do calculations and prepare the dataset\n", "platforms_df = (\n", " df.groupby(\"Platform\")[[\"Global_Sales\"]]\n", " .sum()\n", " .join(df.groupby(\"Platform\")[[\"Name\"]].count())\n", ")\n", "platforms_df.columns = [\"Global_Sales\", \"Number_of_Games\"]\n", "platforms_df.sort_values(\"Global_Sales\", ascending=False, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a bar for the global sales\n", "trace0 = go.Bar(\n", " x=platforms_df.index, y=platforms_df[\"Global_Sales\"], name=\"Global Sales\"\n", ")\n", "\n", "# Create a bar for the number of games released\n", "trace1 = go.Bar(\n", " x=platforms_df.index,\n", " y=platforms_df[\"Number_of_Games\"],\n", " name=\"Number of games released\",\n", ")\n", "\n", "# Get together the data and style objects\n", "data = [trace0, trace1]\n", "layout = {\"title\": \"Market share by gaming platform\"}\n", "\n", "# Create a `Figure` and plot it\n", "fig = go.Figure(data=data, layout=layout)\n", "iplot(fig, show_link=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Box plot\n", "\n", "`plotly` also supports *box plots*. Let’s consider the distribution of critics' ratings by the genre of the game." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = []\n", "\n", "# Create a box trace for each genre in our dataset\n", "for genre in df.Genre.unique():\n", " data.append(go.Box(y=df[df.Genre == genre].Critic_Score, name=genre))\n", "\n", "# Visualize\n", "iplot(data, show_link=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `plotly`, you can also create other types of visualization. Even with default settings, the plots look quite nice. Additionally, the library makes it easy to modify various parameters: colors, fonts, captions, annotations, and so on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Demo assignment\n", "To practice with visual data analysis, you can complete [this assignment](https://www.kaggle.com/kashnitsky/a2-demo-analyzing-cardiovascular-data) where you'll be analyzing cardiovascular disease data.\n", "\n", "## 6. Useful resources\n", "- The same notebook as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-2-part-2-seaborn-and-plotly)\n", "- [\"Plotly for interactive plots\"](https://nbviewer.jupyter.org/github/Yorko/mlcourse.ai/blob/master/jupyter_english/tutorials/plotly_tutorial_for_interactive_plots_sankovalev.ipynb) - a tutorial by Alexander Kovalev within mlcourse.ai (full list of tutorials is [here](https://mlcourse.ai/tutorials))\n", "- [\"Bring your plots to life with Matplotlib animations\"](https://nbviewer.jupyter.org/github/Yorko/mlcourse.ai/blob/master/jupyter_english/tutorials/bring_your_plots_to_life_with_matplotlib_animations_kyriacos_kyriacou.ipynb) - a tutorial by Kyriacos Kyriacou within mlcourse.ai\n", "- [\"Some details on Matplotlib\"](https://nbviewer.jupyter.org/github/Yorko/mlcourse.ai/blob/master/jupyter_english/tutorials/some_details_in_matplotlib_pisarev_ivan.ipynb) - a tutorial by Ivan Pisarev within mlcourse.ai\n", "- Main course [site](https://mlcourse.ai), [course repo](https://github.com/Yorko/mlcourse.ai), and YouTube [channel](https://www.youtube.com/watch?v=QKTuw4PNOsU&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX)\n", "- Medium [\"story\"](https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-2-visual-data-analysis-in-python-846b989675cd) based on this notebook\n", "- Course materials as a [Kaggle Dataset](https://www.kaggle.com/kashnitsky/mlcourse)\n", "- If you read Russian: an [article](https://habrahabr.ru/company/ods/blog/323210/) on Habrahabr with ~ the same material. And a [lecture](https://youtu.be/vm63p8Od0bM) on YouTube\n", "- Here is the official documentation for the libraries we used: [`matplotlib`](https://matplotlib.org/contents.html), [`seaborn`](https://seaborn.pydata.org/introduction.html) and [`pandas`](https://pandas.pydata.org/pandas-docs/stable/).\n", "- The [gallery](http://seaborn.pydata.org/examples/index.html) of sample charts created with `seaborn` is a very good resource.\n", "- Also, see the [documentation](http://scikit-learn.org/stable/modules/manifold.html) on Manifold Learning in `scikit-learn`.\n", "- Efficient t-SNE implementation [Multicore-TSNE](https://github.com/DmitryUlyanov/Multicore-TSNE).\n", "- \"How to Use t-SNE Effectively\", [Distill.pub](https://distill.pub/2016/misread-tsne/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }