{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Aggregating and joining data\n", "\n", "This is the second introductory tutorial to Ibis. If you are new to Ibis, you may want to start at [the beginning of this tutorial](./01-Introduction-to-Ibis.ipynb).\n", "\n", "In the first notebook we saw how to load and query data using `ibis`. In this notebook we'll continue with the same dataset, building up some more complicated queries." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tutorial_utils import setup\n", "\n", "setup()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import ibis\n", "\n", "ibis.options.interactive = True\n", "\n", "connection = ibis.sqlite.connect(\"geography.db\")\n", "countries = connection.table(\"countries\")\n", "\n", "countries[\"name\", \"continent\", \"area_km2\", \"population\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Expressions\n", "\n", "We will continue by exploring the data by continent. We will start by creating an expression\n", "with the continent names, since our table only contains the abbreviations.\n", "\n", "An expression is one or more operations performed over the data. They can be used to retrieve the\n", "data or to build more complex operations.\n", "\n", "In this case we will use a `case` conditional statement to replace values depending on a condition.\n", "A `case` expression will return a case builder, and must be followed by one or more `when` calls,\n", "optionally an `else_` call, and must end with a call to `end`, to complete the full expression.\n", "The expression where `case` is called (`countries['continent']` in this case)\n", "is evaluated to see if it's equal to any of the first arguments of the calls to `when`. And the second\n", "argument is returned. If the value does not match any of the `when` values, the value of `else_` is returned." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "continent_name = (\n", " countries[\"continent\"]\n", " .case()\n", " .when(\"NA\", \"North America\")\n", " .when(\"SA\", \"South America\")\n", " .when(\"EU\", \"Europe\")\n", " .when(\"AF\", \"Africa\")\n", " .when(\"AS\", \"Asia\")\n", " .when(\"OC\", \"Oceania\")\n", " .when(\"AN\", \"Antarctica\")\n", " .else_(\"Unknown continent\")\n", " .end()\n", " .name(\"continent_name\")\n", ")\n", "continent_name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we did is take the values of the column `countries['continent']`, and we created a calculated\n", "column with the names of the continents, as defined in the `when` methods.\n", "\n", "This calculated column is an expression. The computations didn't happen when defining the `continent_name`\n", "variable, and the results are not stored. They have been computed when we printed its content.\n", "\n", "We can see that by checking the type of `continent_name`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(continent_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next tutorial we will see more about eager and lazy mode, and when operations are being\n", "executed. For now we can think that the query to the database happens only when we want to see\n", "the results.\n", "\n", "The important part is that now we can use our `continent_name` expression in other expressions.\n", "For example, since this is a column (a `StringColumn` to be specific), we can use it as a column\n", "to query the countries table.\n", "\n", "Note that when we created the expression we added `.name('continent_name')` to it, so the column\n", "has a name when being returned." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries[\"name\", continent_name, \"area_km2\", \"population\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just for illustration, let's repeat the same query, but renaming the expression to `continent`\n", "when using it in the list of columns to fetch." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries[\"name\", continent_name.name(\"continent\"), \"area_km2\", \"population\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregating data\n", "\n", "Now, let's group our data by continent, and let's find the total population of each." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries.group_by(continent_name).aggregate(\n", " countries[\"population\"].sum().name(\"total_population\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see how Asia is the most populated country, followed by Africa. Antarctica is the least populated,\n", "as we would expect.\n", "\n", "The code to aggregate has two main parts:\n", "- The `group_by` method, that receive the column, expression or list of them to group by\n", "- The `aggregate` method, that receives an expression with the reduction we want to apply\n", "\n", "To make things a bit clearer, let's first save the reduction." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "total_population = countries[\"population\"].sum().name(\"total_population\")\n", "total_population" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, if we perform the operation directly, we will get the sum of the total in the column.\n", "\n", "But if we take the `total_population` expression as the parameter of the `aggregate` method, then the total is computed\n", "over every group defined by the `group_by` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries.group_by(continent_name).aggregate(total_population)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to compute two aggregates at the same time, we can pass a list to the `aggregate` method.\n", "\n", "For illustration, we use the `continent` column, instead of the `continent_names` expression. We can\n", "use both column names and expressions, and also a list with any of them (e.g. `[continent_names, 'name']`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries.group_by(\"continent\").aggregate(\n", " [total_population, countries[\"area_km2\"].mean().name(\"average_area\")]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joining data\n", "\n", "Now we are going to get the total gross domestic product (GDP) for each continent. In this case, the GDP data\n", "is not in the same table `countries`, but in a table `gdp`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdp = connection.table(\"gdp\")\n", "gdp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The table contains information for different years, we can easily check the range with:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdp[\"year\"].min(), gdp[\"year\"].max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we are going to join this data with the `countries` table so we can obtain the continent\n", "of each country. The `countries` table has several different codes for the countries. Let's find out which\n", "one matches the three letter code in the `gdp` table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries[\"iso_alpha2\", \"iso_alpha3\", \"iso_numeric\", \"fips\", \"name\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `country_code` in `gdp` corresponds to `iso_alpha3` in the `countries` table. We can also see\n", "how the `gdp` table has `10,000` rows, while `countries` has `252`. We will start joining the\n", "two tables by the codes that match, discarding the codes that do not exist in both tables.\n", "This is called an inner join." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries_and_gdp = countries.inner_join(\n", " gdp, predicates=countries[\"iso_alpha3\"] == gdp[\"country_code\"]\n", ")\n", "countries_and_gdp[countries, gdp]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We joined the table with the information for all years. Now we are going to just take the information about the last available year, 2017." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdp_2017 = gdp.filter(gdp[\"year\"] == 2017)\n", "gdp_2017" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Joining with the new expression we get:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries_and_gdp = countries.inner_join(\n", " gdp_2017, predicates=countries[\"iso_alpha3\"] == gdp_2017[\"country_code\"]\n", ")\n", "countries_and_gdp[countries, gdp_2017]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have called the `inner_join` method of the `countries` table and passed\n", "the `gdp` table as a parameter. The method receives a second parameter, `predicates`, that is used to specify\n", "how the join will be performed. In this case we want the `iso_alpha3` column in `countries` to\n", "match the `country_code` column in `gdp`. This is specified with the expression\n", "`countries['iso_alpha3'] == gdp['country_code']`.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" } }, "nbformat": 4, "nbformat_minor": 4 }