{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas demo\n",
    "\n",
    "#### Henry Schreiner\n",
    "\n",
    "A few basic imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### MPG dataset\n",
    "\n",
    "Let's read a CSV file from the web *directly* into Pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv')\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Info about the dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Detailed memory usage:\n",
    "<!--\n",
    "data.memory_usage(deep=True)\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can ask for statistical information per column or often per dataframe:\n",
    "\n",
    "<!--\n",
    "data.mean()\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plots are easy; try `.plot`, `.scatter`, or `.hist`:\n",
    "\n",
    "<!--\n",
    "data.mpg.plot();\n",
    "data.plot.scatter('weight', 'horsepower');\n",
    "data.mpg.hist(bins=30);\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can select subsets, such as `mpg > 42`:\n",
    "\n",
    "<!--\n",
    "data[data.mpg > 42]\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or you can use `groupby('cylinders')` to work on per-cylinder groups: \n",
    "\n",
    "<!--\n",
    "data.groupby('cylinders').mpg.hist();\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want legends, it's no longer one line, but still simple:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for name, grp in data.groupby('cylinders').mpg:\n",
    "    grp.hist(label=name)\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The category type is better for data.origin, and saves memory too!\n",
    "\n",
    "<!--\n",
    "data.origin.memory_usage(deep=True)\n",
    "data.origin.astype('category').memory_usage(deep=True)\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.origin.astype('category')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can select using operators or isin:\n",
    "\n",
    "<!--\n",
    "data[~data.origin.isin(['usa', 'japan'])]\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's convert the name into make and model:\n",
    "\n",
    "<!--\n",
    "makemodel = cars.name.str.split(n=1, expand=True)\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cars = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv',\n",
    "                  dtype={\"origin\":\"category\"})\n",
    "\n",
    "...\n",
    "# cars[\"make\"], cars[\"model\"] = makemodel[0].astype('category'), makemodel[1]\n",
    "# del cars[\"name\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can put make and model together again:\n",
    "\n",
    "<!--\n",
    "cars.make.str.upper() + \" \" + cars.model.str.title()\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Math: To convert mpg to liters per 100 kilometers:\n",
    "\n",
    "$$\n",
    "lp100km = \\frac{1}{mpg} \\cdot 62.1371 \\frac{\\mathrm{miles}}{\\mathrm{100 km}} \\cdot 3.78541 \\frac{\\mathrm{liter}}{\\mathrm{gallon}}\n",
    "$$\n",
    "\n",
    "<!--\n",
    "1 / cars.mpg * 62.1371 * 3.78541\n",
    "cars.mpg.apply(lambda x: 1/x * 62.1371 * 3.78541)\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also see:\n",
    "\n",
    "\n",
    "* 10 minutes to Pandas: https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html\n",
    "* 100 tips: https://www.dataschool.io/python-pandas-tips-and-tricks\n",
    "* My blog: https://iscinumpy.gitlab.io"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not covered above:\n",
    "* Fantastic date/time support, including holidays\n",
    "* Resampling, interpolation\n",
    "* Multi-indexing\n",
    "* Support for many input formats, such as HTML, the clipbord, Excel files, and more"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}