{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Working with Dataframe Index in Lux" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import lux" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Collecting basic usage statistics for Lux (For more information, see: https://tinyurl.com/logging-consent)\n", "lux.logger = True # Remove this line if you do not want your interactions recorded" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An index is a convenient way for users to access and manipulate the values of a dataframe. Often, indices are created as part of dataframe operations, such as `groupby`, `pivot`, or `crosstab`.\n", "\n", "Lux allows users to visualize dataframe with indices. In pandas, there are two types of indices: \n", "\n", " 1. column index (accessed via `.columns` property of the dataframe), and \n", " 2. row index (accessed via `.index` property of the dataframe). \n", " \n", "A dataframe index can be thought of as an extra row/column that indicates the values that the user is interested in. Lux focuses on visualizing named dataframe indices, i.e., indices with a non-null `name` property, as a proxy of the attribute that the user is interested in or have operated on (e.g., group-by attribute). In particular, dataframes with *named* indices are often pre-aggregated, so Lux visualizes *exactly* the values that the dataframe portrays.\n", "\n", "For every *named* row/column indexes in Pandas, Lux displays the corresponding types of actions: \n", "\n", " 1. **Column Groups:** shows all possible visualizations generated by groups of column-wise values \n", " \n", " \n", " 2. **Row Groups:** shows all possible visualizations generated by groups of row-wise values \n", " \n", " \n", "\n", "We will illustrate these two different actions in this tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Column Group Example: Group-by aggregation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we group by `FundingModel` and compute the average value of the numeric attribute. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"../data/college.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Column Groups action in Lux shows all possible bar chart distributions based on these pre-computed average values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.groupby(\"FundingModel\").mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, since the column index is unnamed, with the columns simply being a subset of the attributes in the original dataframe, Lux does not show the visualization based on each row (Row Group action)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Row Group Example: Time Series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are various common ways to represent time series data via a dataframe. One of the most common ways is a \"long\" table representation where each observation is stored independently, one for each row, as shown in the example below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = 'https://github.com/lux-org/lux-datasets/blob/master/data/state_timeseries.csv?raw=true'\n", "df = pd.read_csv(url)\n", "df[\"Date\"] = pd.to_datetime(df[\"Date\"])\n", "df.to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To re-arrange this dataframe into time series where each `State` represents a line chart, we perform the `pivot` operation, alongside some interpolation to fill in the missing values. This gives us a \"wide\" table representation where the column index is each `Date`, and the row index is each `State`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tseries = df.pivot(index=\"State\",columns=\"Date\",values=\"Value\")\n", "# Interpolating missing values\n", "tseries[tseries.columns.min()] = tseries[tseries.columns.min()].fillna(0)\n", "tseries[tseries.columns.max()] = tseries[tseries.columns.max()].fillna(tseries.max(axis=1))\n", "tseries = tseries.interpolate('zero',axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since both the row and column index is named, Lux displays both Row and Column Group actions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print (\"Column Index Name:\", tseries.columns.name)\n", "print (\"Row Index Name:\", tseries.index.name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lux displays the Row Groups action where each row is plotted as a time series. This allows us to take a look at the time series, from Alabama to Wyoming. Note that in both the Row and Column Groups, we do not perform sorting of the visualizations in order to preserve their one-to-one correspondence with what is displayed in the dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "tseries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Column Group action is less interesting in this case, nevertheless, we can still see that starting from Febrary, Oregon had the leading values compared to other states. As we move towards March and April, California and Florida becomes one of the states with the highest values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Complementary Perspectives of Row and Column Groups: Cross-Tabulation of Two Factors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often, Row and Column Groups can present complementary perspectives on the dataset. This is especially true for dataframes resulting from [`pandas.cross-tab`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Example from http://www.datasciencemadesimple.com/cross-tab-cross-table-python-pandas/\n", "d = {\n", " 'Name':['Alisa','Bobby','Cathrine','Alisa','Bobby','Cathrine',\n", " 'Alisa','Bobby','Cathrine','Alisa','Bobby','Cathrine'],\n", " 'Exam':['Semester 1','Semester 1','Semester 1','Semester 1','Semester 1','Semester 1',\n", " 'Semester 2','Semester 2','Semester 2','Semester 2','Semester 2','Semester 2'],\n", " \n", " 'Subject':['Mathematics','Mathematics','Mathematics','Science','Science','Science',\n", " 'Mathematics','Mathematics','Mathematics','Science','Science','Science'],\n", " 'Result':['Pass','Pass','Fail','Pass','Fail','Pass','Pass','Fail','Fail','Pass','Pass','Fail']}\n", " \n", "df = pd.DataFrame(d,columns=['Name','Exam','Subject','Result'])\n", "df.to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cross-tabulation sums up the number of occurrences for a particular outcome. For example, two students failed the exam in Semester 1, and four passed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Row Group shows visualizations comparing the Result of Fail/Pass across Semesters. The first visualization indicates that more students passed than failed in Semester 1, whereas in Semester 2, the Fail/Pass rate is equal amongst the students.\n", "The Column Group shows visualizations comparing the exam results from different Semesters across different results. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = pd.crosstab([df.Exam],df.Result)\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = pd.crosstab([df.Subject],df.Result)\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that Lux does not currently support dataframes with multiple index levels, as a result, it does not support crosstabs with more than 2 factors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = pd.crosstab([df.Subject,df.Exam],df.Result)\n", "result" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }