{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Analysing the Palmer Penguins 🐧 dataset\n", "\n", "The notebook can contains the code for the accompanying blogpost titled **[Intelligent Visual Data Discovery with Lux — A Python library](https://towardsdatascience.com/intelligent-visual-data-discovery-with-lux-a-python-library-dc36a5742b2f?sk=676ebc78e32561a50f93770e487a05cd)** by [Parul Pandey](https://www.linkedin.com/in/parulpandeyindia/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](https://miro.medium.com/max/576/1*KU-V8tWWQU3nDtw12-bQ_g.png) \n", "*Artwork by @allison_horst*\n", "\n", "The palmer penguins dataset has currently become a favorite of the data science community and is a drop-in replacement for the overused Iris dataset. The dataset consists of data for 344 penguins. The data were collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/). The data can be downloaded from [here](https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data?select=penguins_size.csv). Let’s start by installing and importing the Lux library.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Uncomment the code to install the library and desired extensions\n", "\n", "#! pip install lux-api\n", "\n", "#Activating extension for Jupyter notebook\n", "#! jupyter nbextension install --py luxwidget\n", "#! jupyter nbextension enable --py luxwidget\n", "\n", "#Activating extension for Jupyter lab\n", "#! jupyter labextension install @jupyter-widgets/jupyterlab-manager\n", "#! jupyter labextension install luxwidget" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more details like using Lux with SQL engine, read the documentation, which is pretty robust and contains many hands-on examples.\n", "\n", "## Importing the necessary libraries and the dataset\n", "Once the Lux library has been installed, we’ll import it along with our dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import lux" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('../data/penguins.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lux's nice thing is that it can be used as it is with the pandas dataframe and doesn’t require any modifications to the existing syntax. For instance, if you drop any column or row, the recommendations are regenerated based on the updated dataframe. All the nice functionalities that we get from pandas like dropping columns, importing CSVs, etc., are also preserved. Let’s get an overview of the data set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are some missing values in the dataset. Let’s get rid of those." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our data is now in memory, and we are all set to see how Lux can ease the EDA process for us." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## EDA with Lux: Supporting a Visual dataframe workflow\n", "\n", "When we print out the data frame, we see the default pandas table display. We can toggle it to get a set of recommendations generated automatically by Lux." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The recommendations in lux are organized by three different tabs, which represent potential next steps that users can take in their exploration. From the visualisations we infer that there are three different species of penguins — Adelie, Chinstrap, and Gentoo. There are also three different islands — Torgersen, Biscoe, and Dream; and both male and female species have been included in the dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Intent-based recommendations\n", "\n", "Beyond the basic recommendations, we can also specify our analysis intent. Let's say that we want to find out how the culmen length varies with the species. We can set the intent here as `[‘culmen_length_mm’,’species’].`When we print out the data frame again, we can see that the recommendations are steered to what is relevant to the intent that we’ve specified." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.intent = ['culmen_length_mm','species']\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the left-hand side in the image below, what we see is `Current Visualization` corresponding to the attributes that we have selected. On the right-hand side, we have `Enhance` i.e. what happens when we add an attribute to the current selection. We also have the `Filter` tab which adds filter while fixing the selected variable.\n", "\n", "If you closely look at the correlations within species, culmen length and depth are positively correlated. This is a classic example of [**Simpson’s paradox**](https://en.wikipedia.org/wiki/Simpson%27s_paradox).\n", "\n", "![](https://miro.medium.com/max/254/1*bN1pTPMGUB8g7EpQurQbsQ.png)\n", "\n", "Finally, you can get a pretty clear separation between all three species by looking at flipper length versus culmen length.\n", "\n", "\n", "\n", "![Image for post](https://miro.medium.com/max/258/1*1VeJ6DCycXM67Eg6l41vdg.png)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exporting visualizations from Widget\n", "\n", "Lux also makes it pretty easy to export and share the generated visualizations. The visualizations can be exported into a static HTML as follows:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.save_as_html('file.html')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also access the set of [recommendations generated for the](https://lux-api.readthedocs.io/en/latest/source/guide/export.html) data frames via the properties recommendation. The output is a dictionary, keyed by the name of the recommendation category." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.recommendation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exporting Visualizations as Code\n", "\n", "Not only can we export visualization as HTML but also as code. The visualizations can then be exported to code in [Altair](https://altair-viz.github.io/) for further edits or as [Vega-Lite](https://vega.github.io/vega-lite/) specification. More details can be found in the [documentation](https://lux-api.readthedocs.io/en/latest/source/guide/export.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vis = df.recommendation[\"Enhance\"][1]\n", "vis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(vis.to_altair())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "print(vis.to_vegalite())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", "For more support and resources on Lux:\n", "\n", "- Sign up for the early-user [mailing list](https://forms.gle/XKv3ejrshkCi3FJE6) to stay tuned for upcoming releases, updates, or user studies.\n", "- Visit [ReadTheDoc](https://lux-api.readthedocs.io/en/latest/) for more detailed documentation.\n", "- Clone [lux-binder](https://github.com/lux-org/lux-binder) to try out these [hands-on exercises](https://github.com/lux-org/lux-binder/tree/master/exercise) or [tutorial series](https://github.com/lux-org/lux-binder/tree/master/tutorial) on how to use Lux.\n", "- Join our community [Slack](https://lux-project.slack.com/join/shared_invite/zt-iwg84wfb-fBPaGTBBZfkb9arziy3W~g) to discuss and ask questions.\n", "- Report any bugs, issues, or requests through [Github Issues](https://github.com/lux-org/lux/issues).\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }