{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 04 Grabbing HTML tables with Pandas\n", "What if you saw a table you wanted on a web page? For example: https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions. \n", "\n", "_Can Python help us download those data?_ **Why yes. Yes it can.**\n", "\n", "Specifically, we use the Pandas' `read_html` function, which is able to identify tables in an HTML page and pull them out into a dataframe object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Import pandas\n", "import pandas\n", "\n", "#Install lxml if not present\n", "try:\n", " import lxml\n", "except:\n", " !pip install lxml\n", " print(\"Please restart the kernel\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Here, the read_html function pulls into a list object any table in the URL we provide.\n", "tableList = pandas.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions')\n", "print (\"{} tables were found\".format(len(tableList)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Let's grab the 2st table one and display it's first five rows\n", "df = tableList[1]\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we see here is a table with two header rows (and Pandas was smart enough to recognize that). This is a multi-index table - a powerful, but confusing format. Basically, it allows us to means to select columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Show the hierarchical column names\n", "df.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Select all rows and just the columns under the \"Fossil CO2 emissions(Mt CO2)\" heading\n", "df_fossil = df.loc[:,'Fossil CO2 emissions(Mt CO2)']\n", "df_fossil.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Let's add the country to our subset\n", "df_fossil = df.loc[:,['Country[20]','Fossil CO2 emissions(Mt CO2)']]\n", "df_fossil.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#We can drop the top level column indices\n", "df_fossil = df_fossil.droplevel(level=0,axis=1)\n", "df_fossil.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Tidy up the column names\n", "df_fossil.columns = ['Country','1990_emmissions','2005_emmissions','2017_emmissions']\n", "#Set the country as the index\n", "df_fossil.set_index('Country',inplace=True)\n", "df_fossil.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Now we can save it to a local file using df.to_csv()\n", "df_fossil.to_csv(\"CarbonEmissions.csv\", # The output filename\n", " index=True, # We opt not to write out the index\n", " encoding='utf8') # This deals with issues surrounding countries with odd characters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#...or we can examine it\n", "#Here is as quick preview of pandas' plotting capability\n", "df_fossil.iloc[3:].plot.scatter(x='1990_emmissions',y='2017_emmissions',grid=True);" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.12" } }, "nbformat": 4, "nbformat_minor": 2 }