{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  04 Grabbing HTML tables with Pandas\n",
    "What if you saw a table you wanted on a web page? For example:https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions. Can Python help us download those data? \n",
    "\n",
    "Why yes. Yes it can.\n",
    "\n",
    "Specifically, we use the Pandas' `read_html` function, which is able to identify tables in an HTML page and pull them out into a dataframe object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Import pandas\n",
    "import pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#We'll need a package called lxml; install if not already\n",
    "try:\n",
    "    import lxml\n",
    "except:\n",
    "    !pip install lxml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Here, the read_html function pulls into a list object any table in the URL we provide.\n",
    "the_url = 'https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions'\n",
    "tableList = pandas.read_html(the_url)\n",
    "print (\"{} tables were found\".format(len(tableList)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Let's grab the 2nd table one and display it's firt five rows\n",
    "df = tableList[1]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As an aside, the resulting table has multiple column indices. Mutliindex dataframes are powerful, but also quite confusing. We'll simply drop the first header row using the `drop_level()` command"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Fetch just the columns associated with the top column level of \"Fossil CO2 emissions(Mt CO2)\"\n",
    "df_fixed = df.droplevel(\n",
    "    level=0,         #drops the first header row\n",
    "    axis ='columns') #tells pandas we are dealing with columns, not rows\n",
    "df_fixed.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Now we can save it to a local file using df.to_csv()\n",
    "df.to_csv(\"Carbon.csv\",    # The output filename\n",
    "          index=False,     # We opt not to write out the index\n",
    "          encoding='utf8') # This deals with issues surrounding countries with odd characters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#...or we can examine it\n",
    "#Here is as quick preview of pandas' plotting capability\n",
    "%matplotlib inline\n",
    "df_fixed.iloc[3:,].plot.scatter(x='1990',y='2017');"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}