\n",
"\n",
" \n",
"## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n",
"\n",
"Author: [Egor Polusmak](https://www.linkedin.com/in/egor-polusmak/). Translated and edited by Alena Sharlo, [Yury Kashnitsky](https://yorko.github.io), [Artem Trunov](https://www.linkedin.com/in/datamove), [Anastasia Manokhina](https://www.linkedin.com/in/anastasiamanokhina/), and [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#
Topic 2. Visual data analysis in Python\n",
"##
Part 2. Overview of Seaborn, Matplotlib and Plotly libraries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Article outline\n",
"\n",
"1. [Dataset](#1.-Dataset)\n",
"2. [DataFrame.plot()](#2.-DataFrame.plot)\n",
"3. [Seaborn](#3.-Seaborn)\n",
"4. [Plotly](#4.-Plotly)\n",
"5. [Demo assignment](#5.-Demo-assignment)\n",
"6. [Useful resources](#6.-Useful-resources)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Dataset\n",
"\n",
"First, we will set up our environment by importing all necessary libraries. We will also change the display settings to better show plots."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Matplotlib forms basis for visualization in Python\n",
"import matplotlib.pyplot as plt\n",
"# We will use the Seaborn library\n",
"import seaborn as sns\n",
"\n",
"sns.set()\n",
"\n",
"# Graphics in SVG format are more sharp and legible\n",
"%config InlineBackend.figure_format = 'svg'\n",
"\n",
"# Increase the default plot size and set the color scheme\n",
"plt.rcParams[\"figure.figsize\"] = (8, 5)\n",
"plt.rcParams[\"image.cmap\"] = \"viridis\"\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let’s load the dataset that we will be using into a `DataFrame`. I have picked a dataset on video game sales and ratings from [Kaggle Datasets](https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings).\n",
"Some of the games in this dataset lack ratings; so, let’s filter for only those examples that have all of their values present."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"../../data/video_games_sales.csv\").dropna()\n",
"print(df.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, print the summary of the `DataFrame` to check data types and to verify everything is non-null."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that `pandas` has loaded some of the numerical features as `object` type. We will explicitly convert those columns into `float` and `int`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[\"User_Score\"] = df[\"User_Score\"].astype(\"float64\")\n",
"df[\"Year_of_Release\"] = df[\"Year_of_Release\"].astype(\"int64\")\n",
"df[\"User_Count\"] = df[\"User_Count\"].astype(\"int64\")\n",
"df[\"Critic_Count\"] = df[\"Critic_Count\"].astype(\"int64\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The resulting `DataFrame` contains 6825 examples and 16 columns. Let’s look at the first few entries with the `head()` method to check that everything has been parsed correctly. To make it more convenient, I have listed only the variables that we will use in this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"useful_cols = [\n",
" \"Name\",\n",
" \"Platform\",\n",
" \"Year_of_Release\",\n",
" \"Genre\",\n",
" \"Global_Sales\",\n",
" \"Critic_Score\",\n",
" \"Critic_Count\",\n",
" \"User_Score\",\n",
" \"User_Count\",\n",
" \"Rating\",\n",
"]\n",
"df[useful_cols].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. DataFrame.plot()\n",
"\n",
"Before we turn to Seaborn and Plotly, let’s discuss the simplest and often most convenient way to visualize data from a `DataFrame`: using its own `plot()` method.\n",
"\n",
"As an example, we will create a plot of video game sales by country and year. First, let’s keep only the columns we need. Then, we will calculate the total sales by year and call the `plot()` method on the resulting `DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[[x for x in df.columns if \"Sales\" in x] + [\"Year_of_Release\"]].groupby(\n",
" \"Year_of_Release\"\n",
").sum().plot();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the implementation of the `plot()` method in `pandas` is based on `matplotlib`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the `kind` parameter, you can change the type of the plot to, for example, a *bar chart*. `matplotlib` is generally quite flexible for customizing plots. You can change almost everything in the chart, but you may need to dig into the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html) to find the corresponding parameters. For example, the parameter `rot` is responsible for the rotation angle of ticks on the x-axis (for vertical plots):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[[x for x in df.columns if \"Sales\" in x] + [\"Year_of_Release\"]].groupby(\n",
" \"Year_of_Release\"\n",
").sum().plot(kind=\"bar\", rot=45);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Seaborn\n",
"\n",
"Now, let's move on to the `Seaborn` library. `seaborn` is essentially a higher-level API based on the `matplotlib` library. Among other things, it differs from the latter in that it contains more adequate default settings for plotting. By adding `import seaborn as sns; sns.set()` in your code, the images of your plots will become much nicer. Also, this library contains a set of complex tools for visualization that would otherwise (i.e. when using bare `matplotlib`) require quite a large amount of code.\n",
"\n",
"#### pairplot()\n",
"\n",
"Let's take a look at the first of such complex plots, a *pairwise relationships plot*, which creates a matrix of scatter plots by default. This kind of plot helps us visualize the relationship between different variables in a single output."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# `pairplot()` may become very slow with the SVG format\n",
"%config InlineBackend.figure_format = 'png'\n",
"sns.pairplot(\n",
" df[[\"Global_Sales\", \"Critic_Score\", \"Critic_Count\", \"User_Score\", \"User_Count\"]]\n",
");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, the distribution histograms lie on the diagonal of the matrix. The remaining charts are scatter plots for the corresponding pairs of features.\n",
"\n",
"#### distplot()\n",
"\n",
"It is also possible to plot a distribution of observations with `seaborn`'s `distplot()`. For example, let's look at the distribution of critics' ratings: `Critic_Score`. By default, the plot displays a histogram and the [kernel density estimate](https://en.wikipedia.org/wiki/Kernel_density_estimation)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%config InlineBackend.figure_format = 'svg'\n",
"sns.distplot(df[\"Critic_Score\"]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### jointplot()\n",
"\n",
"To look more closely at the relationship between two numerical variables, you can use *joint plot*, which is a cross between a scatter plot and histogram. Let's see how the `Critic_Score` and `User_Score` features are related."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.jointplot(x=\"Critic_Score\", y=\"User_Score\", data=df, kind=\"scatter\");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### boxplot()\n",
"\n",
"Another useful type of plot is a *box plot*. Let's compare critics' ratings for the top 5 biggest gaming platforms."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"top_platforms = (\n",
" df[\"Platform\"].value_counts().sort_values(ascending=False).head(5).index.values\n",
")\n",
"sns.boxplot(\n",
" y=\"Platform\",\n",
" x=\"Critic_Score\",\n",
" data=df[df[\"Platform\"].isin(top_platforms)],\n",
" orient=\"h\",\n",
");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is worth spending a bit more time to discuss how to interpret a box plot. Its components are a *box* (obviously, this is why it is called a *box plot*), the so-called *whiskers*, and a number of individual points (*outliers*).\n",
"\n",
"The box by itself illustrates the interquartile spread of the distribution; its length determined by the $25\\% \\, (\\text{Q1})$ and $75\\% \\, (\\text{Q3})$ percentiles. The vertical line inside the box marks the median ($50\\%$) of the distribution. \n",
"\n",
"The whiskers are the lines extending from the box. They represent the entire scatter of data points, specifically the points that fall within the interval $(\\text{Q1} - 1.5 \\cdot \\text{IQR}, \\text{Q3} + 1.5 \\cdot \\text{IQR})$, where $\\text{IQR} = \\text{Q3} - \\text{Q1}$ is the [interquartile range](https://en.wikipedia.org/wiki/Interquartile_range).\n",
"\n",
"Outliers that fall out of the range bounded by the whiskers are plotted individually.\n",
"\n",
"#### heatmap()\n",
"\n",
"The last type of plot that we will cover here is a *heat map*. A heat map allows you to view the distribution of a numerical variable over two categorical ones. Let’s visualize the total sales of games by genre and gaming platform."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"platform_genre_sales = (\n",
" df.pivot_table(\n",
" index=\"Platform\", columns=\"Genre\", values=\"Global_Sales\", aggfunc=sum\n",
" )\n",
" .fillna(0)\n",
" .applymap(float)\n",
")\n",
"sns.heatmap(platform_genre_sales, annot=True, fmt=\".1f\", linewidths=0.5);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Plotly\n",
"\n",
"We have examined some visualization tools based on the `matplotlib` library. However, this is not the only option for plotting in `Python`. Let’s take a look at the `plotly` library. Plotly is an open-source library that allows creation of interactive plots within a Jupyter notebook without having to use Javascript.\n",
"\n",
"The real beauty of interactive plots is that they provide a user interface for detailed data exploration. For example, you can see exact numerical values by mousing over points, hide uninteresting series from the visualization, zoom in onto a specific part of the plot, etc.\n",
"\n",
"Before we start, let’s import all the necessary modules and initialize `plotly` by calling the `init_notebook_mode()` function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import plotly\n",
"import plotly.graph_objs as go\n",
"from plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plot\n",
"\n",
"init_notebook_mode(connected=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Line plot\n",
"\n",
"First of all, let’s build a *line plot* showing the number of games released and their sales by year."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"years_df = (\n",
" df.groupby(\"Year_of_Release\")[[\"Global_Sales\"]]\n",
" .sum()\n",
" .join(df.groupby(\"Year_of_Release\")[[\"Name\"]].count())\n",
")\n",
"years_df.columns = [\"Global_Sales\", \"Number_of_Games\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`Figure` is the main class and a work horse of visualization in `plotly`. It consists of the data (an array of lines called `traces` in this library) and the style (represented by the `layout` object). In the simplest case, you may call the `iplot` function to return only `traces`.\n",
"\n",
"The `show_link` parameter toggles the visibility of the links leading to the online platform `plot.ly` in your charts. Most of the time, this functionality is not needed, so you may want to turn it off by passing `show_link=False` to prevent accidental clicks on those links."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create a line (trace) for the global sales\n",
"trace0 = go.Scatter(x=years_df.index, y=years_df[\"Global_Sales\"], name=\"Global Sales\")\n",
"\n",
"# Create a line (trace) for the number of games released\n",
"trace1 = go.Scatter(\n",
" x=years_df.index, y=years_df[\"Number_of_Games\"], name=\"Number of games released\"\n",
")\n",
"\n",
"# Define the data array\n",
"data = [trace0, trace1]\n",
"\n",
"# Set the title\n",
"layout = {\"title\": \"Statistics for video games\"}\n",
"\n",
"# Create a Figure and plot it\n",
"fig = go.Figure(data=data, layout=layout)\n",
"iplot(fig, show_link=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As an option, you can save the plot in an html file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plotly.offline.plot(fig, filename=\"years_stats.html\", show_link=False);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Bar chart\n",
"\n",
"Let's use a *bar chart* to compare the market share of different gaming platforms broken down by the number of new releases and by total revenue."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Do calculations and prepare the dataset\n",
"platforms_df = (\n",
" df.groupby(\"Platform\")[[\"Global_Sales\"]]\n",
" .sum()\n",
" .join(df.groupby(\"Platform\")[[\"Name\"]].count())\n",
")\n",
"platforms_df.columns = [\"Global_Sales\", \"Number_of_Games\"]\n",
"platforms_df.sort_values(\"Global_Sales\", ascending=False, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create a bar for the global sales\n",
"trace0 = go.Bar(\n",
" x=platforms_df.index, y=platforms_df[\"Global_Sales\"], name=\"Global Sales\"\n",
")\n",
"\n",
"# Create a bar for the number of games released\n",
"trace1 = go.Bar(\n",
" x=platforms_df.index,\n",
" y=platforms_df[\"Number_of_Games\"],\n",
" name=\"Number of games released\",\n",
")\n",
"\n",
"# Get together the data and style objects\n",
"data = [trace0, trace1]\n",
"layout = {\"title\": \"Market share by gaming platform\"}\n",
"\n",
"# Create a `Figure` and plot it\n",
"fig = go.Figure(data=data, layout=layout)\n",
"iplot(fig, show_link=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Box plot\n",
"\n",
"`plotly` also supports *box plots*. Let’s consider the distribution of critics' ratings by the genre of the game."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = []\n",
"\n",
"# Create a box trace for each genre in our dataset\n",
"for genre in df.Genre.unique():\n",
" data.append(go.Box(y=df[df.Genre == genre].Critic_Score, name=genre))\n",
"\n",
"# Visualize\n",
"iplot(data, show_link=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using `plotly`, you can also create other types of visualization. Even with default settings, the plots look quite nice. Additionally, the library makes it easy to modify various parameters: colors, fonts, captions, annotations, and so on."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Demo assignment\n",
"To practice with visual data analysis, you can complete [this assignment](https://www.kaggle.com/kashnitsky/a2-demo-analyzing-cardiovascular-data) where you'll be analyzing cardiovascular disease data.\n",
"\n",
"## 6. Useful resources\n",
"- The same notebook as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-2-part-2-seaborn-and-plotly)\n",
"- [\"Plotly for interactive plots\"](https://nbviewer.jupyter.org/github/Yorko/mlcourse.ai/blob/master/jupyter_english/tutorials/plotly_tutorial_for_interactive_plots_sankovalev.ipynb) - a tutorial by Alexander Kovalev within mlcourse.ai (full list of tutorials is [here](https://mlcourse.ai/tutorials))\n",
"- [\"Bring your plots to life with Matplotlib animations\"](https://nbviewer.jupyter.org/github/Yorko/mlcourse.ai/blob/master/jupyter_english/tutorials/bring_your_plots_to_life_with_matplotlib_animations_kyriacos_kyriacou.ipynb) - a tutorial by Kyriacos Kyriacou within mlcourse.ai\n",
"- [\"Some details on Matplotlib\"](https://nbviewer.jupyter.org/github/Yorko/mlcourse.ai/blob/master/jupyter_english/tutorials/some_details_in_matplotlib_pisarev_ivan.ipynb) - a tutorial by Ivan Pisarev within mlcourse.ai\n",
"- Main course [site](https://mlcourse.ai), [course repo](https://github.com/Yorko/mlcourse.ai), and YouTube [channel](https://www.youtube.com/watch?v=QKTuw4PNOsU&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX)\n",
"- Medium [\"story\"](https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-2-visual-data-analysis-in-python-846b989675cd) based on this notebook\n",
"- Course materials as a [Kaggle Dataset](https://www.kaggle.com/kashnitsky/mlcourse)\n",
"- If you read Russian: an [article](https://habrahabr.ru/company/ods/blog/323210/) on Habrahabr with ~ the same material. And a [lecture](https://youtu.be/vm63p8Od0bM) on YouTube\n",
"- Here is the official documentation for the libraries we used: [`matplotlib`](https://matplotlib.org/contents.html), [`seaborn`](https://seaborn.pydata.org/introduction.html) and [`pandas`](https://pandas.pydata.org/pandas-docs/stable/).\n",
"- The [gallery](http://seaborn.pydata.org/examples/index.html) of sample charts created with `seaborn` is a very good resource.\n",
"- Also, see the [documentation](http://scikit-learn.org/stable/modules/manifold.html) on Manifold Learning in `scikit-learn`.\n",
"- Efficient t-SNE implementation [Multicore-TSNE](https://github.com/DmitryUlyanov/Multicore-TSNE).\n",
"- \"How to Use t-SNE Effectively\", [Distill.pub](https://distill.pub/2016/misread-tsne/)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}