{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas demo\n", "\n", "#### Henry Schreiner\n", "\n", "A few basic imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### MPG dataset\n", "\n", "Let's read a CSV file from the web *directly* into Pandas." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv')\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Info about the dataframe:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Detailed memory usage:\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can ask for statistical information per column or often per dataframe:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plots are easy; try `.plot`, `.scatter`, or `.hist`:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can select subsets, such as `mpg > 42`:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or you can use `groupby('cylinders')` to work on per-cylinder groups: \n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want legends, it's no longer one line, but still simple:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for name, grp in data.groupby('cylinders').mpg:\n", " grp.hist(label=name)\n", "plt.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The category type is better for data.origin, and saves memory too!\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.origin.astype('category')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can select using operators or isin:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's convert the name into make and model:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cars = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv',\n", " dtype={\"origin\":\"category\"})\n", "\n", "...\n", "# cars[\"make\"], cars[\"model\"] = makemodel[0].astype('category'), makemodel[1]\n", "# del cars[\"name\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can put make and model together again:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Math: To convert mpg to liters per 100 kilometers:\n", "\n", "$$\n", "lp100km = \\frac{1}{mpg} \\cdot 62.1371 \\frac{\\mathrm{miles}}{\\mathrm{100 km}} \\cdot 3.78541 \\frac{\\mathrm{liter}}{\\mathrm{gallon}}\n", "$$\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also see:\n", "\n", "\n", "* 10 minutes to Pandas: https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html\n", "* 100 tips: https://www.dataschool.io/python-pandas-tips-and-tricks\n", "* My blog: https://iscinumpy.gitlab.io" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not covered above:\n", "* Fantastic date/time support, including holidays\n", "* Resampling, interpolation\n", "* Multi-indexing\n", "* Support for many input formats, such as HTML, the clipbord, Excel files, and more" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.1" } }, "nbformat": 4, "nbformat_minor": 4 }