{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 04 Grabbing HTML tables with Pandas\n", "What if you saw a table you wanted on a web page? For example:https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions. Can Python help us download those data? \n", "\n", "Why yes. Yes it can.\n", "\n", "Specifically, we use the Pandas' `read_html` function, which is able to identify tables in an HTML page and pull them out into a dataframe object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Import pandas\n", "import pandas" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#We'll need a package called lxml; install if not already\n", "try:\n", " import lxml\n", "except:\n", " !pip install lxml" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Here, the read_html function pulls into a list object any table in the URL we provide.\n", "the_url = 'https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions'\n", "tableList = pandas.read_html(the_url)\n", "print (\"{} tables were found\".format(len(tableList)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Let's grab the 2nd table one and display it's firt five rows\n", "df = tableList[1]\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an aside, the resulting table has multiple column indices. Mutliindex dataframes are powerful, but also quite confusing. We'll simply drop the first header row using the `drop_level()` command" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Fetch just the columns associated with the top column level of \"Fossil CO2 emissions(Mt CO2)\"\n", "df_fixed = df.droplevel(\n", " level=0, #drops the first header row\n", " axis ='columns') #tells pandas we are dealing with columns, not rows\n", "df_fixed.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Now we can save it to a local file using df.to_csv()\n", "df.to_csv(\"Carbon.csv\", # The output filename\n", " index=False, # We opt not to write out the index\n", " encoding='utf8') # This deals with issues surrounding countries with odd characters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#...or we can examine it\n", "#Here is as quick preview of pandas' plotting capability\n", "%matplotlib inline\n", "df_fixed.iloc[3:,].plot.scatter(x='1990',y='2017');" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 2 }