{ "cells": [ { "cell_type": "markdown", "id": "ca4dc9b7", "metadata": {}, "source": [ "# Global food trade data\n", "\n", "!!! info \"\"\n", " :octicons-person-16: **[Pablo Rosado](https://ourworldindata.org/team/pablo-rosado)** • :octicons-calendar-16: June 18, 2026 *(last edit)* • [**:octicons-mail-16: Feedback**](mailto:info@ourworldindata.org?subject=Feedback%20on%20technical%20publication%20-%20Analysing%20global%20food%20trade%20data)" ] }, { "cell_type": "markdown", "id": "ee1ed15b", "metadata": {}, "source": [ "\n", "## Introduction\n", "\n", "This is a technical companion to our article [\"_How does food get traded around the world?_\"](https://ourworldindata.org/how-does-food-get-traded-around-the-world). It examines [FAOSTAT's Detailed Trade Matrix (TM)](https://www.fao.org/faostat/en/#data/TM), and explains the methodology and choices behind our visualizations.\n", "\n", "### Content\n", "The rest of this publication is structured as follows:\n", "\n", "- **Definitions**: some clarifications on the terminology around food trade data.\n", "- **Set up**: initial steps to be able to run this file, and some basic processing.\n", "- **Coverage:** how complete the bilateral trade data is, across years, countries, and commodities.\n", "- **Reporting agreement:** why the exporting and importing countries often report different quantities for the same shipment, and how we tackle that.\n", "- **Estimating production and supply:** how we estimate an exporter's production and an importer's domestic supply, by combining two datasets.\n", "- **Combining items:** how we group certain food items that are different variants of the same product (e.g. nuts with and without shell).\n" ] }, { "cell_type": "markdown", "id": "ec1ce8de", "metadata": {}, "source": [ "## Definitions\n", "The TM dataset contains _item flows_.\n", "An item is a food product, either a primary commodity such as wheat or bananas, or a processed product such as meat (fish products are not included in the dataset).\n", "A flow is a trade exchange between two countries, namely an import or an export.\n", "\n", "Each flow is measured as a quantity (in tonnes, or number of animals) and as a monetary value (in dollars).\n", "Here we will analyse quantities of traded food (in tonnes).\n", "\n", "Each entry in the data corresponds to a trade flow reported by a certain country and year, with a given partner country.\n", "For example, in one entry, Brazil reports exporting ~4 million tonnes of soya beans to Argentina in 2023.\n", "\n", "In some edge cases, the reporter and partner countries coincide; we remove those cases." ] }, { "cell_type": "markdown", "id": "d9cdbd74", "metadata": {}, "source": [ "## Set up\n", "\n", "In this section I'll guide you through a few steps you'd need to be able to run the code in this file, and replicate its results.\n", "If you are only interested in reading the outcomes, feel free to skip this section.\n", "\n", "### Required libraries\n", "\n", "Whether you run this code locally or on Google Collab, there are some libraries you may need to install." ] }, { "cell_type": "code", "execution_count": null, "id": "2ad05630", "metadata": { "tags": [ "collapsed" ] }, "outputs": [], "source": [ "# Whether you run this code in your local computer or on Google Colab, you first need to install the required dependencies:\n", "# * owid-catalog: a library that lets you explore and load datasets from OWID.\n", "# * plotly: a library that lets you create interactive visualizations.\n", "%pip install --upgrade owid-catalog plotly > /dev/null 2>&1" ] }, { "cell_type": "markdown", "id": "dc10fa7f", "metadata": {}, "source": [ "Import necessary libraries and define some common variables." ] }, { "cell_type": "code", "execution_count": null, "id": "dd172ee7", "metadata": {}, "outputs": [], "source": [ "import contextlib\n", "import io\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import plotly.express as px\n", "import plotly.graph_objects as go\n", "from owid.catalog import fetch\n", "\n", "\n", "def load_owid_table(catalog_path):\n", " \"\"\"Load a table from OWID's public catalog, so this notebook runs anywhere (e.g. Colab).\n", "\n", " `catalog_path` is \"channel/namespace/version/dataset/table\". Until these datasets are\n", " published to the public catalog, this falls back to the local ETL data directory.\n", " \"\"\"\n", " # The catalog's loading spinner overwrites one terminal line with \"\\r\"; under nbconvert each\n", " # frame is captured as its own output line, so silence stdout while loading.\n", " with contextlib.redirect_stdout(io.StringIO()):\n", " try:\n", " tb = fetch(catalog_path)\n", " return tb.reset_index() if list(tb.index.names) != [None] else tb\n", " except Exception:\n", " from owid.catalog import Dataset\n", " from etl.paths import DATA_DIR\n", "\n", " channel, namespace, version, dataset, table = catalog_path.split(\"/\")\n", " return Dataset(DATA_DIR / channel / namespace / version / dataset).read(table, safe_types=False)\n", "\n", "\n", "tb = load_owid_table(\"garden/faostat/2026-05-07/faostat_tm/faostat_tm\")\n", "tb_scl = load_owid_table(\"garden/faostat/2026-02-25/faostat_scl/faostat_scl\")\n", "tb_food_trade = load_owid_table(\"garden/faostat/2026-05-07/food_trade/food_trade\")\n", "\n", "# A few curated items combine several FAO codes (a primary commodity plus a mechanically-derived\n", "# form whose trade FAO reports separately). The food_trade step sums them; this mirrors the\n", "# multi-code entries in food_trade.items.yaml. Maps display name -> {FAO code: short form label}.\n", "COMBINED_ITEMS = {\n", " \"Beef\": {867: \"bone-in\", 870: \"boneless\"},\n", " \"Pork\": {1035: \"bone-in\", 1038: \"boneless\"},\n", " \"Almonds\": {221: \"in-shell\", 231: \"shelled\"},\n", " \"Cashews\": {217: \"in-shell\", 230: \"shelled\"},\n", " \"Walnuts\": {222: \"in-shell\", 232: \"shelled\"},\n", " \"Hazelnuts\": {225: \"in-shell\", 233: \"shelled\"},\n", " \"Brazil nuts\": {216: \"in-shell\", 229: \"shelled\"},\n", " \"Groundnuts\": {242: \"in-shell\", 243: \"shelled\"},\n", " \"Rice\": {31: \"milled\", 32: \"broken\", 28: \"husked\"},\n", " \"Sugar\": {162: \"raw\", 164: \"refined\"},\n", "}\n", "\n", "# Map each curated FAO code to its display name. Combined items appear in the published table under\n", "# a synthetic id (100000 + first code), so we expand them back to their underlying FAO codes here,\n", "# letting analyses that join on the trade matrix's own codes still cover every curated item.\n", "curated_code_to_display = {}\n", "for code, item in dict(zip(tb_food_trade[\"item_code\"].astype(int), tb_food_trade[\"item\"].astype(str))).items():\n", " if item in COMBINED_ITEMS:\n", " for member in COMBINED_ITEMS[item]:\n", " curated_code_to_display[member] = item\n", " else:\n", " curated_code_to_display[code] = item" ] }, { "cell_type": "markdown", "id": "a89842f5", "metadata": {}, "source": [ "### Basic data preparation\n", "\n", "The TM dataset has ~52 million entries.\n", "\n", "We select only rows of item flow quantities (in tonnes); then the dataset contains ~26 million entries.\n", "\n", "We also drop rows of self-trade, where reporter and partner countries coincide (which are just a ~0.1% of the data).\n", "\n", "Around 12% of rows correspond to trade of zero tonnes (which we keep in the data)." ] }, { "cell_type": "code", "execution_count": null, "id": "d92eaed7", "metadata": {}, "outputs": [], "source": [ "# Total number of entries\n", "# len(tb)\n", "\n", "# Basic data preparation: restrict to quantity reports in tonnes (this drops the\n", "# alternate-unit rows used for live animals — 'An', '1000 An', 'No' — and the\n", "# value-in-USD rows),\n", "tb = tb[(tb[\"unit\"] == \"t\")].reset_index(drop=True)\n", "# len(tb) * 1e-6\n", "\n", "# Drop self-trade rows (~0.1% of the data, where reporter\n", "# and partner are the same country).\n", "tb = tb[tb[\"reporter_country\"].astype(str) != tb[\"partner_country\"].astype(str)].reset_index(drop=True)\n", "# len(tb) * 1e-6\n", "\n", "# Share of rows with zero tonnes reported.\n", "# 100 * len(tb[tb[\"value\"] == 0]) / len(tb)" ] }, { "cell_type": "markdown", "id": "0e28cf08", "metadata": {}, "source": [ "## Coverage\n", "\n", "### List of item flows included\n", "\n", "The dataset has over 500 items.\n", "To avoid clutter, for the final custom visualization we have selected a curated list of items.\n", "However, in this analysis we will consider all items (except when we explicitly mention that we use the curated list).\n", "\n", "### Data provenance flags\n", "FAOSTAT flags each quantity row with its provenance.\n", "\n", "In general, quantity (in tonnes) and value (in dollars) are independently collected in the TM dataset.\n", "\n", "About 98% of rows in the TM dataset are flagged \"Official figure\"; the remaining ~2% are flagged as imputed, estimated, or sourced from an external organization.\n", "\n", "We keep all rows regardless of flag." ] }, { "cell_type": "code", "execution_count": 4, "id": "173b913d", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:25:45.499673Z", "iopub.status.busy": "2026-06-18T15:25:45.499601Z", "iopub.status.idle": "2026-06-18T15:25:45.643298Z", "shell.execute_reply": "2026-06-18T15:25:45.642945Z" }, "tags": [ "collapsed" ] }, "outputs": [], "source": [ "# Double-check the flag-share numbers quoted above.\n", "imputed_flags = [\"Value imputed by a receiving agency\", \"Estimated value\", \"Figure from external organization\"]\n", "quantity_rows = tb[tb[\"element\"].isin([\"Import quantity\", \"Export quantity\"])]\n", "# Share of rows flagged \"Official figure\" across all years (claimed: ~98%).\n", "official_share = (quantity_rows[\"flag\"] == \"Official figure\").mean() * 100\n", "# Share of rows with a non-official flag across all years (claimed: ~2%).\n", "non_official_share = quantity_rows[\"flag\"].isin(imputed_flags).mean() * 100\n", "assert 97 < official_share < 99\n", "assert 1 < non_official_share < 3" ] }, { "cell_type": "markdown", "id": "ad0dc7fd", "metadata": {}, "source": [ "### Year coverage\n", "\n", "We would ideally show food trade figures for the most recent year available in the data.\n", "However, the latest year may be affected by incomplete reporting.\n", "\n", "The following chart helps us decide which year to choose.\n", "It shows the number of rows, as well as the number of distinct reporting countries, for each year in the dataset." ] }, { "cell_type": "code", "execution_count": null, "id": "ef84887e", "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " \n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def plot_coverage(tb):\n", " \"\"\"Show two bars per year: number of rows and number of distinct\n", " reporting countries. When both drop together a year is partially\n", " reported; when only rows drop it might be due to an actual trade contraction.\"\"\"\n", " grouped = tb.groupby(\"year\", observed=True)\n", " rows = grouped.size().sort_index()\n", " reporters = grouped[\"reporter_country\"].nunique().sort_index()\n", " years = rows.index.astype(int).tolist()\n", "\n", " fig = go.Figure()\n", " fig.add_bar(x=years, y=rows.values, name=\"Rows\", yaxis=\"y1\", opacity=0.8)\n", " fig.add_bar(x=years, y=reporters.values, name=\"Distinct reporters\", yaxis=\"y2\", opacity=0.8)\n", " fig.update_layout(\n", " title=\"Yearly data coverage\",\n", " xaxis=dict(title=\"Year\"),\n", " yaxis=dict(title=\"Number of rows\", side=\"left\"),\n", " yaxis2=dict(title=\"Number of distinct reporters\", side=\"right\", overlaying=\"y\", showgrid=False),\n", " barmode=\"group\",\n", " legend=dict(orientation=\"h\", yanchor=\"bottom\", y=1.02, xanchor=\"right\", x=1),\n", " )\n", " fig.show()\n", "\n", "plot_coverage(tb)" ] }, { "cell_type": "markdown", "id": "ec77804c", "metadata": {}, "source": [ "We can see that the latest year (2024) shows a significant drop in coverage with respect to the second latest.\n", "\n", "So, in our custom visualization, and in the rest of the analysis, we will focus on the latest year with a reasonably complete coverage, namely 2023." ] }, { "cell_type": "code", "execution_count": null, "id": "e14b5af3", "metadata": {}, "outputs": [], "source": [ "# Year to use for year-specific analyses from now on.\n", "YEAR = 2023" ] }, { "cell_type": "markdown", "id": "6d19b045", "metadata": {}, "source": [ "## Reporting agreement\n", "As mentioned before, Brazil reports exporting ~4 million tonnes of soya beans to Argentina in 2023.\n", "Ideally, one would expect that item flows would be reported both ways, and that both countries would agree on the traded amount.\n", "Indeed, there's an entry for 2023, where Argentina reports importing ~4 million tonnes of soya beans from Brazil.\n", "\n", "However, this is very often not the case.\n", "- Item flows are often reported only one way.\n", "- When item flows are reported both ways, the quantity often diverges.\n", "\n", "As we will see later, trade flows are sometimes reported only by the importer country, sometimes the exporter, and sometimes both.\n", "\n", "When both countries report the same flow, there are often significant disagreements in the reported quantities.\n", "These differences can arise even when both countries follow the same international guidelines, due to timing, partner country attribution, confidentiality, and other factors. FAOSTAT explicitly notes in their [Detailed Trade Matrix methodology note](https://files-faostat.fao.org/production/TM/TM_e.pdf) that \"the trade matrix data are un-reconciled\".\n", "\n", "For example, in 2023 Ukraine reports exporting ~2 million tonnes of wheat to Romania, but Romania only reports importing ~200 thousand tonnes (a factor of ~10 apart).\n", "\n", "According to FAOSTAT's [Food Balance Sheets and Supply Utilization Accounts Resource Handbook 2025](https://openknowledge.fao.org/items/30e641b1-6d57-4998-8e8a-395727aa4307), section 6.1,\n", "> \"[...] imports are typically documented more thoroughly and verified more rigorously than exports\".\n", "\n", "This claim cites the United Nations Statistics Division's [International Merchandise Trade Statistics](https://unstats.un.org/unsd/trade/EG-IMTS/IMTS2010-CM%20-%20white%20cover%20version.pdf).\n", "That document mentions, in section 3.24:\n", "> Because of the greater customs scrutiny to which imports are subjected by customs in most countries, it is usually more feasible to derive estimates of exports from counterpart imports.\n", "\n", "Then, in section 9.44:\n", "> \"Import data were, in general, considered to be of better quality than export data because imports are reported in sufficient detail to allow customs to apply duties, taxes or other regulatory controls.\"\n", "\n", "Finally, in section 15.15:\n", "> \"Also, the customs administration is generally more interested in the quantity information for imports than in that for exports, since quantity information is, in some cases, utilized to determine both import duties and the unit values used to validate the price and value information declared by the importers.\"\n", "\n", "However they also note, in section 9.44, that \n", "> [...] for certain commodities and in some countries, export data were viewed as being more accurate for the same reasons.\n", "\n", "With all this, we therefore adopt the following simple criterion:\n", "- If only one country reports the flow, we use that country's quantity.\n", "- If both sides report the flow, we use the **importer-reported** quantity.\n", "\n", "But note that there is unfortunately no simple way to reconcile reported quantities in bilateral trade flows.\n", "For broader context on why bilateral trade reports disagree, see our topic page on trade and globalization, specifically [the section on trade data discrepancies](https://ourworldindata.org/trade-and-globalization#why-doesn-t-the-data-add-up)." ] }, { "cell_type": "markdown", "id": "2e88a9eb", "metadata": {}, "source": [ "### Reporting by year\n", "\n", "For each item flow (an item going from a specific country to another) in a year, we count how many times that flow was reported both ways (\"matched\") or only one way (\"exporter-only\" or \"importer-only\").\n", "The result is shown in the following chart." ] }, { "cell_type": "code", "execution_count": 7, "id": "f6b51cab", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:25:46.775246Z", "iopub.status.busy": "2026-06-18T15:25:46.775193Z", "iopub.status.idle": "2026-06-18T15:26:08.554146Z", "shell.execute_reply": "2026-06-18T15:26:08.553689Z" }, "tags": [ "collapsed" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def plot_reporting_coverage_by_year(tb):\n", " \"\"\"Stacked bar chart per year showing what fraction of bilateral flows\n", " are reported by both sides (matched), only by the exporter, or only by\n", " the importer.\"\"\"\n", " qty = tb[tb[\"element\"].isin([\"Export quantity\", \"Import quantity\"])][\n", " [\"reporter_country\", \"partner_country\", \"item_code\", \"year\", \"element\"]\n", " ].copy()\n", " for col in (\"reporter_country\", \"partner_country\"):\n", " qty[col] = qty[col].astype(str)\n", "\n", " exp_keys = qty.loc[\n", " qty[\"element\"] == \"Export quantity\",\n", " [\"reporter_country\", \"partner_country\", \"item_code\", \"year\"],\n", " ].drop_duplicates()\n", " imp_keys = (\n", " qty.loc[\n", " qty[\"element\"] == \"Import quantity\",\n", " [\"reporter_country\", \"partner_country\", \"item_code\", \"year\"],\n", " ]\n", " .rename(columns={\"reporter_country\": \"partner_country\", \"partner_country\": \"reporter_country\"})\n", " .drop_duplicates()\n", " )\n", "\n", " merged = exp_keys.merge(\n", " imp_keys, how=\"outer\", indicator=True,\n", " on=[\"reporter_country\", \"partner_country\", \"item_code\", \"year\"],\n", " )\n", " by_year = (\n", " merged.groupby(\"year\", observed=True)[\"_merge\"]\n", " .value_counts(normalize=True)\n", " .unstack(fill_value=0.0)\n", " .rename(columns={\"both\": \"matched\", \"left_only\": \"exporter-only\", \"right_only\": \"importer-only\"})\n", " )\n", " by_year = by_year[[\"matched\", \"exporter-only\", \"importer-only\"]].reset_index()\n", " long = by_year.melt(id_vars=\"year\", var_name=\"status\", value_name=\"share\")\n", "\n", " fig = px.bar(\n", " long, x=\"year\", y=\"share\", color=\"status\",\n", " title=\"Reporting coverage\",\n", " labels={\"year\": \"Year\", \"share\": \"Share of item flows\"},\n", " category_orders={\"status\": [\"matched\", \"exporter-only\", \"importer-only\"]},\n", " )\n", " fig.update_layout(barmode=\"stack\", yaxis_tickformat=\".0%\")\n", " fig.show()\n", "\n", "plot_reporting_coverage_by_year(tb)" ] }, { "cell_type": "markdown", "id": "ded11af3", "metadata": {}, "source": [ "We can see that the reporting inconsistency is quite severe.\n", "Most item flows are reported only one way (only the importer, or only the exporter).\n", "\n", "In 2023, only around 40% of item exchanges were reported both ways.\n", "In other words, 60% of all (exporter, importer, item) flows show up in only one side's records." ] }, { "cell_type": "code", "execution_count": 8, "id": "ef68325d", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:08.555498Z", "iopub.status.busy": "2026-06-18T15:26:08.555431Z", "iopub.status.idle": "2026-06-18T15:26:09.153792Z", "shell.execute_reply": "2026-06-18T15:26:09.153143Z" }, "tags": [ "collapsed" ] }, "outputs": [], "source": [ "# Simple code to double-check that result for 2023 alone.\n", "exporter_reports = tb[(tb[\"year\"] == YEAR) & (tb[\"element\"] == \"Export quantity\")].rename(columns={\"reporter_country\": \"exporter\", \"partner_country\": \"importer\"})\n", "importer_reports = tb[(tb[\"year\"] == YEAR) & (tb[\"element\"] == \"Import quantity\")].rename(columns={\"reporter_country\": \"importer\", \"partner_country\": \"exporter\"})\n", "exporter_reports_set = set(exporter_reports[\"exporter\"].astype(\"string\") + exporter_reports[\"importer\"].astype(\"string\") + exporter_reports[\"item\"].astype(\"string\"))\n", "importer_reports_set = set(importer_reports[\"exporter\"].astype(\"string\") + importer_reports[\"importer\"].astype(\"string\") + importer_reports[\"item\"].astype(\"string\"))\n", "# Complete set of item flows reported (by importers, exporters, or both).\n", "all_flows_set = importer_reports_set | exporter_reports_set\n", "# Share of all flows that are reported by importers only.\n", "# 100 * len(importer_reports_set - exporter_reports_set) / len(all_flows_set)\n", "\n", "# Share of all flows that are reported by exporters only.\n", "# 100 * len(exporter_reports_set - importer_reports_set) / len(all_flows_set)\n", "\n", "# Share of all flows that are reported by both importers and exporters:\n", "# 100 * len(exporter_reports_set & importer_reports_set) / len(all_flows_set)" ] }, { "cell_type": "markdown", "id": "fc3945c0", "metadata": {}, "source": [ "### Reporting by item\n", "\n", "The previous chart shows the share of matched flows averaged over all items. We can also break the same question down by item, restricting to the items in our curated list, for a chosen year.\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "60235331", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:09.155317Z", "iopub.status.busy": "2026-06-18T15:26:09.155226Z", "iopub.status.idle": "2026-06-18T15:26:09.770917Z", "shell.execute_reply": "2026-06-18T15:26:09.770563Z" }, "tags": [ "collapsed" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def plot_reporting_coverage_by_item(tb, year, top_n=None):\n", " \"\"\"Bar chart of the share of bilateral flows reported by both sides\n", " ('matched'), per item in our curated list, for the given year.\n", "\n", " Useful for spotting which items in our curated list suffer most from\n", " one-sided reporting (low matched share = most flows are reported by\n", " only one country).\"\"\"\n", " code_to_display = curated_code_to_display\n", "\n", " sub = tb[\n", " (tb[\"year\"] == year)\n", " & tb[\"item_code\"].isin(code_to_display)\n", " & tb[\"element\"].isin([\"Export quantity\", \"Import quantity\"])\n", " ]\n", " cols = [\"reporter_country\", \"partner_country\", \"item_code\"]\n", " qty = sub[cols + [\"element\"]].copy()\n", " for col in (\"reporter_country\", \"partner_country\"):\n", " qty[col] = qty[col].astype(str)\n", "\n", " exp_keys = qty.loc[qty[\"element\"] == \"Export quantity\", cols].drop_duplicates()\n", " imp_keys = (\n", " qty.loc[qty[\"element\"] == \"Import quantity\", cols]\n", " .rename(columns={\"reporter_country\": \"partner_country\", \"partner_country\": \"reporter_country\"})\n", " .drop_duplicates()\n", " )\n", " merged = exp_keys.merge(imp_keys, how=\"outer\", indicator=True, on=cols)\n", "\n", " by_item = (\n", " merged.groupby(\"item_code\", observed=True)[\"_merge\"]\n", " .value_counts(normalize=True)\n", " .unstack(fill_value=0.0)\n", " )\n", " matched = by_item[\"both\"] if \"both\" in by_item.columns else pd.Series(0.0, index=by_item.index)\n", " matched.index = matched.index.map(code_to_display)\n", " matched = matched.sort_values(ascending=True)\n", "\n", " if top_n is not None:\n", " matched = matched.head(top_n)\n", "\n", " fig = px.bar(\n", " x=matched.values,\n", " y=matched.index,\n", " orientation=\"h\",\n", " title=f\"Share of item flows reported by both sides in {year}, by item\",\n", " labels={\"x\": \"Share matched\", \"y\": \"Item\"},\n", " )\n", " fig.update_layout(xaxis_tickformat=\".0%\", height=max(400, 22 * len(matched)))\n", " fig.show()\n", "\n", "\n", "plot_reporting_coverage_by_item(tb, year=YEAR)\n" ] }, { "cell_type": "markdown", "id": "99321c09", "metadata": {}, "source": [ "### Reporting by country\n", "\n", "For each country, we can compute its reporting rate: the fraction of item flows the country is involved in (as either exporter or importer) that it itself reports to FAOSTAT.\n", "\n", "As the chart below shows, in 2023, 46 countries traded food but reported nothing at all.\n", "This means that everything we know about their food trade comes from what their partners report.\n", "Examples of those countries are Russia, Bangladesh, Algeria, Iraq, Chile, and Iran.\n", "\n", "Only one country reported more than 90% of item flows (Netherlands).\n", "\n", "And it is reasonable to expect that some important trade flows will be entirely missing from the dataset.\n", "For example, any trade between countries that report nothing will be ignored, e.g. any flow between Russia and Iran." ] }, { "cell_type": "code", "execution_count": 10, "id": "2c1ce6ae", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:09.772378Z", "iopub.status.busy": "2026-06-18T15:26:09.772299Z", "iopub.status.idle": "2026-06-18T15:26:10.787793Z", "shell.execute_reply": "2026-06-18T15:26:10.787387Z" }, "tags": [ "collapsed" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def plot_country_reporting_rate_distribution(tb, year):\n", " \"\"\"Histogram of country reporting rates for the given year.\n", "\n", " For each country C, the reporting rate is the fraction of bilateral\n", " flows involving C (as either exporter or importer) for which C itself\n", " submitted a report to FAOSTAT. A rate near 0% means we know about C's\n", " trade only because C's partners report it.\n", " A rate near 100% means C publishes its own side of almost every flow\n", " it's involved in.\n", "\n", " The \"0%\" bar is given its own bucket so the silent group stands out\n", " from the rest of the (0, 100%] distribution.\"\"\"\n", " qty = tb[(tb[\"year\"] == year) & tb[\"element\"].isin([\"Export quantity\", \"Import quantity\"])][\n", " [\"reporter_country\", \"partner_country\", \"item_code\", \"element\"]\n", " ].copy()\n", " for col in (\"reporter_country\", \"partner_country\"):\n", " qty[col] = qty[col].astype(str)\n", "\n", " exp = qty.loc[qty[\"element\"] == \"Export quantity\", [\"reporter_country\", \"partner_country\", \"item_code\"]].drop_duplicates()\n", " exp.columns = [\"exporter\", \"importer\", \"item_code\"]\n", " exp[\"by_exp\"] = True\n", " imp = qty.loc[qty[\"element\"] == \"Import quantity\", [\"reporter_country\", \"partner_country\", \"item_code\"]].drop_duplicates()\n", " imp.columns = [\"importer\", \"exporter\", \"item_code\"]\n", " imp[\"by_imp\"] = True\n", "\n", " flows = exp.merge(imp, on=[\"exporter\", \"importer\", \"item_code\"], how=\"outer\")\n", " flows[\"by_exp\"] = flows[\"by_exp\"].notna()\n", " flows[\"by_imp\"] = flows[\"by_imp\"].notna()\n", "\n", " ev = flows.groupby(\"exporter\").agg(n_flows=(\"item_code\", \"size\"), n_self=(\"by_exp\", \"sum\")).reset_index().rename(columns={\"exporter\": \"country\"})\n", " iv = flows.groupby(\"importer\").agg(n_flows=(\"item_code\", \"size\"), n_self=(\"by_imp\", \"sum\")).reset_index().rename(columns={\"importer\": \"country\"})\n", " pc = pd.concat([ev, iv]).groupby(\"country\", as_index=False).sum()\n", " pc[\"rate\"] = pc[\"n_self\"] / pc[\"n_flows\"]\n", "\n", " # Build 11 bins: an exact-0% bucket plus ten 10%-wide buckets covering\n", " # (0, 100%]. A tiny epsilon separates the 0% bucket from \"(0, 10%]\".\n", " eps = 1e-9\n", " bin_edges = [-eps, eps, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0 + eps]\n", " bin_labels = [\"0%\", \"0-10%\", \"10-20%\", \"20-30%\", \"30-40%\", \"40-50%\",\n", " \"50-60%\", \"60-70%\", \"70-80%\", \"80-90%\", \"90-100%\"]\n", " pc[\"bin\"] = pd.cut(pc[\"rate\"], bins=bin_edges, labels=bin_labels, include_lowest=True)\n", " counts = pc.groupby(\"bin\", observed=False).size().reset_index(name=\"n_countries\")\n", "\n", " fig = px.bar(\n", " counts,\n", " x=\"bin\",\n", " y=\"n_countries\",\n", " text=\"n_countries\",\n", " title=f\"Distribution of country reporting rates in {year}\",\n", " labels={\"bin\": \"Reporting rate\", \"n_countries\": \"Number of countries\"},\n", " category_orders={\"bin\": bin_labels},\n", " )\n", " fig.update_traces(textposition=\"outside\")\n", " fig.update_layout(showlegend=False)\n", " fig.show()\n", "\n", "\n", "plot_country_reporting_rate_distribution(tb, year=YEAR)\n" ] }, { "cell_type": "code", "execution_count": 11, "id": "30cc9ab2", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:10.789167Z", "iopub.status.busy": "2026-06-18T15:26:10.789102Z", "iopub.status.idle": "2026-06-18T15:26:10.790747Z", "shell.execute_reply": "2026-06-18T15:26:10.790367Z" }, "tags": [ "collapsed" ] }, "outputs": [], "source": [ "# List of countries involved in at least one item flow, that have zero reported flows in 2023:\n", "# len((set(tb[(tb[\"year\"] == YEAR)][\"partner_country\"]) | set(tb[(tb[\"year\"] == YEAR)][\"reporter_country\"])) - set(tb[(tb[\"year\"] == YEAR)][\"reporter_country\"]))" ] }, { "cell_type": "markdown", "id": "09bfb299", "metadata": {}, "source": [ "### Quantity agreement\n", "\n", "For item flows that were reported both ways, how closely do the two reported quantities agree?\n", "Let's define an agreement ratio as:\n", "\n", "`min(exported quantity, imported quantity) / max(exported quantity, imported quantity)`\n", "\n", "This means that 100% is a perfect match (i.e. both countries report exactly the same quantity), and 50% means one country reports twice the other.\n", "\n", "The following chart shows the agreement ratio for the top food trading countries." ] }, { "cell_type": "code", "execution_count": 12, "id": "4ae5a07a", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:10.792062Z", "iopub.status.busy": "2026-06-18T15:26:10.792003Z", "iopub.status.idle": "2026-06-18T15:26:12.618813Z", "shell.execute_reply": "2026-06-18T15:26:12.618435Z" }, "tags": [ "collapsed" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def plot_quantity_mismatch_by_reporter(tb, year=None, top_n=20, include_unmatched=False):\n", " \"\"\"Stacked horizontal bar showing how each country's reports distribute\n", " across quantity-agreement bands, for the top-N reporting countries.\n", "\n", " A *matched* flow is one where both the exporter and the importer filed a\n", " report. Agreement = min(exp_qty, imp_qty) / max(exp_qty, imp_qty). 100%\n", " = perfect match, 50% = one side reports 2x the other. Bins into five bands.\n", "\n", " The chart aggregates *by the country that filed the report* (the FAOSTAT\n", " `reporter_country` column). Each matched flow therefore contributes two\n", " data points — one to the exporter's distribution and one to the importer's\n", " — both carrying the same agreement value.\n", "\n", " When include_unmatched=True, flows reported by only one side land in a\n", " sixth Unmatched band, attributed to the single reporting country.\"\"\"\n", " if year is None:\n", " rows_per_year = tb.groupby(\"year\", observed=True).size()\n", " year = int(rows_per_year[rows_per_year >= 0.9 * rows_per_year.max()].index.max())\n", "\n", " sub = tb[(tb[\"year\"] == year) & tb[\"element\"].isin([\"Export quantity\", \"Import quantity\"])][\n", " [\"reporter_country\", \"partner_country\", \"item_code\", \"element\", \"value\"]\n", " ].copy()\n", " for col in (\"reporter_country\", \"partner_country\"):\n", " sub[col] = sub[col].astype(str)\n", "\n", " # Canonical flow key: (exporter, importer, item_code) for every row.\n", " # Export rows: exporter = reporter, importer = partner.\n", " # Import rows: exporter = partner, importer = reporter.\n", " is_imp = (sub[\"element\"] == \"Import quantity\").to_numpy()\n", " rep = sub[\"reporter_country\"].to_numpy()\n", " par = sub[\"partner_country\"].to_numpy()\n", " exporter = np.where(is_imp, par, rep)\n", " importer = np.where(is_imp, rep, par)\n", " item_code = sub[\"item_code\"].astype(str).to_numpy()\n", " sub[\"flow_key\"] = [f\"{e}|{i}|{c}\" for e, i, c in zip(exporter, importer, item_code)]\n", "\n", " side_counts = sub.groupby(\"flow_key\").size()\n", " matched_keys = side_counts[side_counts == 2].index\n", " matched_rows = sub[sub[\"flow_key\"].isin(matched_keys)].copy()\n", "\n", " # For each matched flow, the agreement is the same for both of its rows\n", " # (since min/max is computed per flow_key).\n", " matched_rows[\"v_min\"] = matched_rows.groupby(\"flow_key\")[\"value\"].transform(\"min\")\n", " matched_rows[\"v_max\"] = matched_rows.groupby(\"flow_key\")[\"value\"].transform(\"max\")\n", " with np.errstate(divide=\"ignore\", invalid=\"ignore\"):\n", " agreement = np.where(matched_rows[\"v_max\"] > 0,\n", " matched_rows[\"v_min\"] / matched_rows[\"v_max\"], 0.0)\n", " matched_rows[\"agreement\"] = agreement\n", "\n", " matched_band_labels = [\"<25%\", \"25-50%\", \"50-75%\", \"75-90%\", \">=90%\"]\n", " matched_band_colors = [\"#d73027\", \"#fc8d59\", \"#fee08b\", \"#91cf60\", \"#1a9850\"]\n", " matched_rows[\"band\"] = pd.cut(\n", " matched_rows[\"agreement\"],\n", " bins=[-0.001, 0.25, 0.50, 0.75, 0.90, 1.001],\n", " labels=matched_band_labels,\n", " )\n", "\n", " if include_unmatched:\n", " unmatched_keys = side_counts[side_counts == 1].index\n", " unmatched_rows = sub[sub[\"flow_key\"].isin(unmatched_keys)].copy()\n", " unmatched_rows[\"band\"] = \"Unmatched\"\n", " all_rows = pd.concat([\n", " matched_rows[[\"reporter_country\", \"band\", \"value\"]],\n", " unmatched_rows[[\"reporter_country\", \"band\", \"value\"]],\n", " ], ignore_index=True)\n", " band_labels = [\"Unmatched\"] + matched_band_labels\n", " band_colors = [\"#67000d\"] + matched_band_colors\n", " else:\n", " all_rows = matched_rows[[\"reporter_country\", \"band\", \"value\"]].copy()\n", " band_labels = matched_band_labels\n", " band_colors = matched_band_colors\n", " all_rows[\"band\"] = pd.Categorical(all_rows[\"band\"], categories=band_labels)\n", "\n", " # Rank reporters by total tonnes they reported this year (matched + unmatched-by-them rows).\n", " # This matches the natural intuition of \"biggest traders\" better than report count.\n", " reporter_tonnes = all_rows.groupby(\"reporter_country\", observed=True)[\"value\"].sum().sort_values(ascending=False) # ty: ignore[missing-argument]\n", " top = reporter_tonnes.head(top_n).index.tolist() if top_n is not None else reporter_tonnes.index.tolist()\n", " sub_flat = all_rows[all_rows[\"reporter_country\"].isin(top)]\n", " shares = sub_flat.groupby([\"reporter_country\", \"band\"], observed=True).size().unstack(fill_value=0)\n", " shares = shares.reindex(columns=band_labels, fill_value=0)\n", " shares = shares.div(shares.sum(axis=1), axis=0)\n", " shares = shares.sort_values(\">=90%\", ascending=True)\n", "\n", " long = shares.reset_index().melt(id_vars=\"reporter_country\", var_name=\"band\", value_name=\"share\")\n", " title = f\"Quantity agreement among matched flows in {year}\"\n", " x_label = (\"Share of all reports filed by the country\" if include_unmatched\n", " else \"Share of matched-flow reports filed by the country\")\n", " fig = px.bar(\n", " long, x=\"share\", y=\"reporter_country\", color=\"band\", orientation=\"h\",\n", " title=title,\n", " labels={\"share\": x_label, \"reporter_country\": \"Reporter\"},\n", " category_orders={\"band\": band_labels, \"reporter_country\": shares.index.tolist()},\n", " color_discrete_sequence=band_colors,\n", " )\n", " fig.update_layout(\n", " barmode=\"stack\",\n", " xaxis_tickformat=\".0%\",\n", " height=max(400, 18 * len(shares)),\n", " yaxis=dict(tickmode=\"linear\", dtick=1),\n", " )\n", " fig.show()\n", "\n", "\n", "plot_quantity_mismatch_by_reporter(tb, year=2023, top_n=30, include_unmatched=False)" ] }, { "cell_type": "markdown", "id": "961d7791", "metadata": {}, "source": [ "For Brazil, for example, we see that only around 43% of those matched flows agreed with each other well (the reported quantity matched between 75 and 100%).\n", "\n", "This disagreement is surprisingly large and common." ] }, { "cell_type": "code", "execution_count": 13, "id": "40d99b79", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:12.620214Z", "iopub.status.busy": "2026-06-18T15:26:12.620151Z", "iopub.status.idle": "2026-06-18T15:26:14.120279Z", "shell.execute_reply": "2026-06-18T15:26:14.119757Z" }, "tags": [ "collapsed" ] }, "outputs": [], "source": [ "# Simple code to double-check that result for 2023 alone.\n", "# Define a unique identifier of each item flow, namely [importer country][exporter country][item]\n", "t = tb[(tb[\"year\"]==YEAR)].reset_index(drop=True)\n", "t[\"id\"] = \"\"\n", "mask_imports = (t[\"element\"] == \"Import quantity\")\n", "t.loc[mask_imports, \"id\"] = t[mask_imports][\"reporter_country\"].astype(\"string\") + t[mask_imports][\"partner_country\"].astype(\"string\") + t[mask_imports][\"item\"].astype(\"string\")\n", "t.loc[~mask_imports, \"id\"] = t[~mask_imports][\"partner_country\"].astype(\"string\") + t[~mask_imports][\"reporter_country\"].astype(\"string\") + t[~mask_imports][\"item\"].astype(\"string\")\n", "assert t[\"id\"].notnull().all()\n", "# Select item flows reported by both countries in 2023.\n", "t = t[t.groupby(\"id\")[\"id\"].transform(\"count\")==2].reset_index(drop=True)\n", "# Add columns for the minimum and maximum values of each item flow.\n", "t[\"value_min\"] = t.groupby(\"id\")[\"value\"].transform(\"min\")\n", "t[\"value_max\"] = t.groupby(\"id\")[\"value\"].transform(\"max\")\n", "# Define the quantity agreement ratio.\n", "t[\"ratio\"] = t[\"value_min\"] / t[\"value_max\"]\n", "# Percentage of item flows reported by Brazil that have an agreement ratio >=90%:\n", "mask_country = t[\"reporter_country\"] == \"Brazil\"\n", "# print(f'<25%: {100 * len(t[(mask_country) & (t[\"ratio\"]<0.25)]) / len(t[mask_country]):.0f}')\n", "# print(f'25%-50%: {100 * len(t[(mask_country) & (t[\"ratio\"]>=0.25) & (t[\"ratio\"]<0.50)]) / len(t[(mask_country)]):.0f}')\n", "# print(f'50%-75%: {100 * len(t[(mask_country) & (t[\"ratio\"]>=0.50) & (t[\"ratio\"]<0.75)]) / len(t[(mask_country)]):.0f}')\n", "# print(f'75%-90%: {100 * len(t[(mask_country) & (t[\"ratio\"]>=0.75) & (t[\"ratio\"]<0.90)]) / len(t[(mask_country)]):.0f}')\n", "# print(f'>=90%: {100 * len(t[(mask_country) & (t[\"ratio\"]>=0.90)]) / len(t[(mask_country)]):.0f}')" ] }, { "cell_type": "markdown", "id": "3fc27109", "metadata": {}, "source": [ "The following chart is identical to the previous, but it includes flows reported only one-way, in a dedicated \"Unmatched\" band. \n", "This way we can visualize both reporting coverage and quantity-agreement issues in one visualization." ] }, { "cell_type": "code", "execution_count": 14, "id": "65a97494", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:14.121671Z", "iopub.status.busy": "2026-06-18T15:26:14.121581Z", "iopub.status.idle": "2026-06-18T15:26:16.066671Z", "shell.execute_reply": "2026-06-18T15:26:16.066204Z" }, "tags": [ "collapsed" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_quantity_mismatch_by_reporter(tb, year=2023, top_n=30, include_unmatched=True)" ] }, { "cell_type": "markdown", "id": "25674a23", "metadata": {}, "source": [ "The following chart shows the overall picture of agreement ratios across all countries.\n", "We see that only 14% of all reported trade flows in 2023 are reported both ways and with reasonably similar quantities (with agreement ratio above 75%)." ] }, { "cell_type": "code", "execution_count": 15, "id": "3919e197", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:16.068407Z", "iopub.status.busy": "2026-06-18T15:26:16.068323Z", "iopub.status.idle": "2026-06-18T15:26:17.669465Z", "shell.execute_reply": "2026-06-18T15:26:17.669056Z" }, "tags": [ "collapsed" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def plot_quantity_mismatch_distribution(tb, year=None, include_unmatched=True, item_codes=None, countries=None):\n", " \"\"\"Histogram showing what share of bilateral item flows falls into each\n", " quantity-agreement band for the given year.\n", "\n", " Each flow is counted once (unlike `plot_quantity_mismatch_by_reporter`,\n", " which aggregates by the country that filed each report and therefore\n", " counts each matched flow twice — once per side).\n", "\n", " When include_unmatched=True, flows reported by only one side land in a\n", " sixth Unmatched band; all six shares sum to 100% of the year's flows.\n", " When include_unmatched=False, only matched flows are included and the\n", " five matched shares sum to 100% of matched flows.\n", "\n", " Optional filters:\n", " - item_codes: iterable of FAO item codes; restricts to flows whose item\n", " is in the list. Useful for restricting to a curated viz subset.\n", " - countries: iterable of OWID-harmonized country names; restricts to\n", " flows where at least one side (exporter or importer) is in the list.\n", " Use to inspect data quality for a region/grouping of interest.\"\"\"\n", " def _title(year, item_codes, countries):\n", " filters = []\n", " if item_codes is not None:\n", " filters.append(f\"{len(set(item_codes))} items\")\n", " if countries is not None:\n", " filters.append(f\"{len(set(countries))} countries\")\n", " tail = f\" ({', '.join(filters)})\" if filters else \"\"\n", " return f\"Share of item flows by quantity agreement band in {year}{tail}\"\n", "\n", " if year is None:\n", " rows_per_year = tb.groupby(\"year\", observed=True).size()\n", " year = int(rows_per_year[rows_per_year >= 0.9 * rows_per_year.max()].index.max())\n", "\n", " sub = tb[(tb[\"year\"] == year) & tb[\"element\"].isin([\"Export quantity\", \"Import quantity\"])][\n", " [\"reporter_country\", \"partner_country\", \"item_code\", \"element\", \"value\"]\n", " ].copy()\n", " if item_codes is not None:\n", " sub = sub[sub[\"item_code\"].isin(set(item_codes))]\n", " for col in (\"reporter_country\", \"partner_country\"):\n", " sub[col] = sub[col].astype(str)\n", " if countries is not None:\n", " country_set = set(countries)\n", " sub = sub[sub[\"reporter_country\"].isin(country_set) | sub[\"partner_country\"].isin(country_set)]\n", "\n", " is_imp = (sub[\"element\"] == \"Import quantity\").to_numpy()\n", " rep = sub[\"reporter_country\"].to_numpy()\n", " par = sub[\"partner_country\"].to_numpy()\n", " exporter = np.where(is_imp, par, rep)\n", " importer = np.where(is_imp, rep, par)\n", " item_code = sub[\"item_code\"].astype(str).to_numpy()\n", " sub[\"flow_key\"] = [f\"{e}|{i}|{c}\" for e, i, c in zip(exporter, importer, item_code)]\n", "\n", " side_counts = sub.groupby(\"flow_key\").size()\n", " matched = (\n", " sub[sub[\"flow_key\"].isin(side_counts[side_counts == 2].index)]\n", " .groupby(\"flow_key\")[\"value\"]\n", " .agg([\"min\", \"max\"])\n", " )\n", " with np.errstate(divide=\"ignore\", invalid=\"ignore\"):\n", " matched[\"agreement\"] = np.where(matched[\"max\"] > 0, matched[\"min\"] / matched[\"max\"], 0.0)\n", "\n", " matched_band_labels = [\"<25%\", \"25-50%\", \"50-75%\", \"75-90%\", \">=90%\"]\n", " matched_band_colors = [\"#d73027\", \"#fc8d59\", \"#fee08b\", \"#91cf60\", \"#1a9850\"]\n", " matched[\"band\"] = pd.cut(\n", " matched[\"agreement\"],\n", " bins=[-0.001, 0.25, 0.50, 0.75, 0.90, 1.001],\n", " labels=matched_band_labels,\n", " )\n", " band_counts = matched[\"band\"].value_counts().reindex(matched_band_labels, fill_value=0)\n", "\n", " if include_unmatched:\n", " unmatched_count = int((side_counts == 1).sum())\n", " band_labels = [\"Unmatched\"] + matched_band_labels\n", " band_colors = [\"#67000d\"] + matched_band_colors\n", " counts = [unmatched_count] + band_counts.tolist()\n", " else:\n", " band_labels = matched_band_labels\n", " band_colors = matched_band_colors\n", " counts = band_counts.tolist()\n", "\n", " total = sum(counts)\n", " shares = [c / total for c in counts]\n", "\n", " fig = px.bar(\n", " x=band_labels,\n", " y=shares,\n", " color=band_labels,\n", " color_discrete_sequence=band_colors,\n", " category_orders={\"x\": band_labels, \"color\": band_labels},\n", " title=_title(year, item_codes, countries),\n", " labels={\"x\": \"Agreement band\", \"y\": \"Share of flows\"},\n", " text=[f\"{s:.1%}
({int(n):,} flows)\" for s, n in zip(shares, counts)],\n", " )\n", " fig.update_traces(textposition=\"outside\")\n", " fig.update_layout(showlegend=False, yaxis_tickformat=\".0%\")\n", " fig.show()\n", "\n", "\n", "plot_quantity_mismatch_distribution(tb, year=2023, include_unmatched=True)\n" ] }, { "cell_type": "markdown", "id": "4d7f5e32", "metadata": {}, "source": [ "The chart above mixes all 500+ items in the TM dataset. The curated list of items we actually surface in the visualization (see `food_trade.items.yaml`) is much smaller and concentrates on staples.\n", "\n", "However, the picture doesn't change significantly when selecting only items in the curated subset." ] }, { "cell_type": "code", "execution_count": 16, "id": "839f888c", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:17.670814Z", "iopub.status.busy": "2026-06-18T15:26:17.670747Z", "iopub.status.idle": "2026-06-18T15:26:17.672574Z", "shell.execute_reply": "2026-06-18T15:26:17.672122Z" }, "tags": [ "collapsed" ] }, "outputs": [], "source": [ "curated_codes = set(curated_code_to_display)\n", "\n", "# plot_quantity_mismatch_distribution(tb, year=2023, include_unmatched=True, item_codes=curated_codes)" ] }, { "cell_type": "markdown", "id": "a7b7ed5f", "metadata": {}, "source": [ "## Estimating production and supply\n", "Trade figures are more informative in relative terms:\n", "\n", "* exports as a share of what a country produces, and\n", "* imports as a share of what a country consumes domestically.\n", "\n", "In other words, we would ideally show exports as a share of production, and imports as a share of domestic supply.\n", "\n", "However, production and domestic supply are not available in the TM dataset.\n", "So we need to combine this with other FAOSTAT datasets.\n", "\n", "FAOSTAT's Supply Utilization Accounts (SCL) dataset contains production quantity (and trade flows and stock variation) of each item, for each country and year.\n", "Thankfully, SCL and TM datasets use the same item codes (a number that uniquely identifies a FAOSTAT item).\n", "So it is in principle possible to combine both datasets to have production data in our food trade visualizations.\n", "\n", "### Why SCL and not QCL?\n", "\n", "FAOSTAT has two candidate datasets for production data:\n", "\n", "* **QCL (Crops and livestock products)** is FAOSTAT's headline production dataset, but it only contains production-side elements (production, area harvested, yield, animal numbers). It has no imports, exports, or commodity stock variation, so domestic supply cannot be computed from it at all.\n", "* **SCL (Supply Utilization Accounts)** contains production, imports, exports and stock variation (plus utilization elements such as food, feed and processing) for each item, country and year, compiled within one consistent accounting framework. That is everything the Food Balance Sheet identity needs.\n", "\n", "Item coverage is the second, independent reason. A large share of world food trade is in processed or semi-processed forms, and QCL mostly tracks primary production. For 2023:\n", "\n", "* QCL covers 43% of the items traded in TM, accounting for 63% of traded tonnage.\n", "* SCL covers 78% of the items, accounting for 91% of traded tonnage, including the largest traded processed commodities that QCL lacks: cake of soya beans (~128 Mt traded), milled rice (~72 Mt), refined sugar (~46 Mt), wheat flour, and the other oilseed cakes.\n", "\n", "The main price we pay for using SCL is country coverage: as shown in the consistency subsection below, SCL omits ten countries entirely (including Japan and Singapore), which QCL does cover." ] }, { "cell_type": "code", "execution_count": 17, "id": "5f35fbec", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:17.673736Z", "iopub.status.busy": "2026-06-18T15:26:17.673668Z", "iopub.status.idle": "2026-06-18T15:26:27.131159Z", "shell.execute_reply": "2026-06-18T15:26:27.130649Z" }, "tags": [ "collapsed" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "QCL: covers 43% of TM items, 63% of TM traded tonnage\n", "SCL: covers 78% of TM items, 91% of TM traded tonnage\n", " Cake of soya beans: 128 Mt traded\n", " Rice, milled: 72 Mt traded\n", " Other non-alcoholic caloric beverages: 54 Mt traded\n", " Refined sugar: 46 Mt traded\n", " Food preparations n.e.c.: 45 Mt traded\n", " Cake of rapeseed: 29 Mt traded\n", " Pastry: 25 Mt traded\n", " Wheat and meslin flour: 23 Mt traded\n" ] } ], "source": [ "# Double-check the QCL vs SCL coverage numbers quoted above.\n", "tm_year = tb[tb[\"year\"] == YEAR].copy()\n", "tm_year[\"item_code\"] = tm_year[\"item_code\"].astype(int)\n", "tm_tonnage_by_item = pd.Series(tm_year.groupby(\"item_code\", observed=True)[\"value\"].sum())\n", "item_codes_tm = set(tm_tonnage_by_item.index)\n", "\n", "item_codes_scl = set(tb_scl[tb_scl[\"year\"] == YEAR][\"item_code\"].astype(str).astype(int))\n", "\n", "tb_qcl = load_owid_table(\"garden/faostat/2026-02-25/faostat_qcl/faostat_qcl\")\n", "# QCL has no trade elements: production-side only (its \"Stocks\" element is live-animal headcounts).\n", "assert not {\"Import quantity\", \"Export quantity\", \"Stock Variation\"} & set(tb_qcl[\"element\"].astype(str).unique())\n", "item_codes_qcl = set(tb_qcl[(tb_qcl[\"year\"] == YEAR) & (tb_qcl[\"element\"].astype(str) == \"Production\")][\"item_code\"].astype(str).astype(int))\n", "\n", "total_tonnage = tm_tonnage_by_item.sum()\n", "for name, item_codes in [(\"QCL\", item_codes_qcl), (\"SCL\", item_codes_scl)]:\n", " share_items = len(item_codes_tm & item_codes) / len(item_codes_tm)\n", " share_tonnage = tm_tonnage_by_item[tm_tonnage_by_item.index.isin(item_codes)].sum() / total_tonnage\n", " print(f\"{name}: covers {share_items:.0%} of TM items, {share_tonnage:.0%} of TM traded tonnage\")\n", "\n", "assert 0.55 < tm_tonnage_by_item[tm_tonnage_by_item.index.isin(item_codes_qcl)].sum() / total_tonnage < 0.70\n", "assert 0.85 < tm_tonnage_by_item[tm_tonnage_by_item.index.isin(item_codes_scl)].sum() / total_tonnage < 0.95\n", "\n", "# Largest traded items covered by SCL but not by QCL.\n", "only_scl = (item_codes_tm & item_codes_scl) - item_codes_qcl\n", "item_names = pd.DataFrame(tm_year[[\"item_code\", \"item\"]].drop_duplicates()).astype({\"item\": str}).set_index(\"item_code\")[\"item\"]\n", "top_missing = tm_tonnage_by_item[tm_tonnage_by_item.index.isin(only_scl)].nlargest(8)\n", "for code, tonnage in top_missing.items():\n", " print(f\" {item_names.get(code, code)}: {tonnage / 1e6:,.0f} Mt traded\")\n" ] }, { "cell_type": "markdown", "id": "363213bf", "metadata": {}, "source": [ "Domestic supply is not directly available in the TM or SCL datasets, but it can be computed from SCL via the FAOSTAT Food Balance Sheet identity:\n", "\n", "`Production + Imports − Exports − Stock Variation`\n", "\n", "(Stock Variation in SCL is defined as closing minus opening stocks, so stock accumulation reduces the supply available for domestic use.)\n", "\n", "This computation is possible for those traded items for which SCL has data.\n", "\n", "Instead of estimating domestic supply, we could get this data from FAOSTAT's Food Balances dataset (FBS), which has figures for production, domestic supply, imports, and exports.\n", "However, mapping FBS items to QCL or TM items is not straightforward.\n", "\n", "For example, FBS' item \"Cocoa Beans and products\" is a group of various items.\n", "As its metadata says:\n", "`Default composition: 661 Cocoa, beans, 662 Cocoa, paste, 665 Cocoa, powder & cake, 666 Chocolate products nes`\n", "\n", "Properly disaggregating FBS commodities is not trivial (see e.g. [Zhao et al. (2025)](https://www.nature.com/articles/s41597-025-05137-y)).\n", "Hence, for the scope of this project, we decided to estimate supply using SCL data." ] }, { "cell_type": "markdown", "id": "65617a09", "metadata": {}, "source": [ "### Consistency between TM and SCL trade reports\n", "\n", "Our visualization mixes the two datasets: trade flows come from TM, while production and domestic supply come from SCL.\n", "Before combining them, we should check whether their trade accounts are actually consistent: both datasets report total imports and exports of each item for each country, so they can be compared directly.\n", "\n", "The cleanest comparison is between SCL and each country's *own* TM reports (the sum over all partners of what the country itself declared), since that isolates source consistency from the bilateral reconciliation issues discussed earlier (flows reported only by the partner are examined separately below).\n" ] }, { "cell_type": "code", "execution_count": 18, "id": "862d6ca1", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:27.132327Z", "iopub.status.busy": "2026-06-18T15:26:27.132265Z", "iopub.status.idle": "2026-06-18T15:26:29.395043Z", "shell.execute_reply": "2026-06-18T15:26:29.394559Z" }, "tags": [ "collapsed" ] }, "outputs": [], "source": [ "# Build the comparison: own-reported TM totals vs SCL, per (country, item), for YEAR.\n", "tm_own = (\n", " tb[(tb[\"year\"] == YEAR) & tb[\"element\"].isin([\"Import quantity\", \"Export quantity\"])]\n", " .groupby([\"reporter_country\", \"item_code\", \"element\"], observed=True)[\"value\"]\n", " .sum()\n", " .unstack(\"element\")\n", " .reset_index()\n", " .rename(columns={\"reporter_country\": \"country\", \"Import quantity\": \"tm_imports\", \"Export quantity\": \"tm_exports\"})\n", ")\n", "tm_own = pd.DataFrame(tm_own)\n", "tm_own[\"country\"] = tm_own[\"country\"].astype(str)\n", "tm_own[\"item_code\"] = tm_own[\"item_code\"].astype(int)\n", "\n", "scl_trade = tb_scl[\n", " (tb_scl[\"year\"] == YEAR)\n", " & (tb_scl[\"unit_short_name\"] == \"t\")\n", " & tb_scl[\"element\"].astype(str).isin([\"Import quantity\", \"Export quantity\"])\n", "].copy()\n", "scl_trade[\"element\"] = scl_trade[\"element\"].astype(str)\n", "scl_trade[\"country\"] = scl_trade[\"country\"].astype(str)\n", "scl_trade[\"item_code\"] = scl_trade[\"item_code\"].astype(str).astype(int)\n", "scl_trade = pd.DataFrame(\n", " scl_trade.groupby([\"country\", \"item_code\", \"element\"], observed=True)[\"value\"].sum().unstack(\"element\").reset_index()\n", ").rename(columns={\"Import quantity\": \"scl_imports\", \"Export quantity\": \"scl_exports\"})\n", "\n", "trade = tm_own.merge(scl_trade, on=[\"country\", \"item_code\"], how=\"outer\")\n", "# Restrict to individual countries (drop FAO region aggregates).\n", "trade = trade[~trade[\"country\"].str.contains(r\"\\(FAO\\)|World|countries|Union\", case=False, regex=True)].reset_index(drop=True)\n" ] }, { "cell_type": "markdown", "id": "af3f716f", "metadata": {}, "source": [ "#### Where both datasets report, they agree closely\n", "\n", "Using the same agreement ratio as before (`min / max`, so 100% is a perfect match), the two datasets agree within ~10% for the vast majority of trade tonnage: about 94% of import tonnage and 97% of export tonnage where both report a positive figure.\n", "In other words, SCL's trade elements are essentially the countries' own customs reports; the two sources are not at odds.\n" ] }, { "cell_type": "code", "execution_count": 19, "id": "584ceed8", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:29.396363Z", "iopub.status.busy": "2026-06-18T15:26:29.396300Z", "iopub.status.idle": "2026-06-18T15:26:29.430686Z", "shell.execute_reply": "2026-06-18T15:26:29.430303Z" }, "tags": [ "collapsed" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "imports: both report >0 for 38,362 (country, item) pairs\n", " agreement >= 90%: 86.2% of pairs, 94.3% of tonnage\n", " agreement >= 75%: 89.2% of pairs, 97.0% of tonnage\n", " agreement >= 50%: 92.4% of pairs, 99.0% of tonnage\n", "exports: both report >0 for 26,474 (country, item) pairs\n", " agreement >= 90%: 88.5% of pairs, 96.9% of tonnage\n", " agreement >= 75%: 91.1% of pairs, 98.2% of tonnage\n", " agreement >= 50%: 93.7% of pairs, 99.0% of tonnage\n" ] } ], "source": [ "def tm_scl_agreement(trade, side):\n", " t, s = trade[f\"tm_{side}\"], trade[f\"scl_{side}\"]\n", " both = t.notna() & s.notna() & (t > 0) & (s > 0)\n", " pair = pd.concat([t[both], s[both]], axis=1)\n", " ratio = pair.min(axis=1) / pair.max(axis=1)\n", " weight = pair.max(axis=1)\n", " return both, ratio, weight\n", "\n", "for side in [\"imports\", \"exports\"]:\n", " both, ratio, weight = tm_scl_agreement(trade, side)\n", " print(f\"{side}: both report >0 for {both.sum():,} (country, item) pairs\")\n", " for threshold in [0.9, 0.75, 0.5]:\n", " share_pairs = (ratio >= threshold).mean()\n", " share_tonnage = weight[ratio >= threshold].sum() / weight.sum()\n", " print(f\" agreement >= {threshold:.0%}: {share_pairs:.1%} of pairs, {share_tonnage:.1%} of tonnage\")\n", "\n", "# Double-check the claims quoted above.\n", "_, ratio_i, weight_i = tm_scl_agreement(trade, \"imports\")\n", "_, ratio_e, weight_e = tm_scl_agreement(trade, \"exports\")\n", "assert 0.92 < weight_i[ratio_i >= 0.9].sum() / weight_i.sum() < 0.97\n", "assert 0.95 < weight_e[ratio_e >= 0.9].sum() / weight_e.sum() < 0.99\n" ] }, { "cell_type": "code", "execution_count": 20, "id": "72a1a228", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:29.431953Z", "iopub.status.busy": "2026-06-18T15:26:29.431893Z", "iopub.status.idle": "2026-06-18T15:26:29.473117Z", "shell.execute_reply": "2026-06-18T15:26:29.472617Z" }, "tags": [ "collapsed" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Visual check: own-reported TM imports vs SCL imports per (country, item), log-log.\n", "sample = trade[(trade[\"tm_imports\"] > 0) & (trade[\"scl_imports\"] > 0)].sample(5000, random_state=1)\n", "fig = px.scatter(\n", " sample,\n", " x=\"tm_imports\",\n", " y=\"scl_imports\",\n", " hover_data=[\"country\", \"item_code\"],\n", " log_x=True,\n", " log_y=True,\n", " opacity=0.3,\n", " title=f\"Imports per (country, item) in {YEAR}: own TM reports vs SCL (sample of 5,000 pairs)\",\n", ")\n", "fig.add_trace(go.Scatter(x=[1e-2, 1e8], y=[1e-2, 1e8], mode=\"lines\", line=dict(dash=\"dash\", color=\"gray\"), name=\"y = x\"))\n", "fig.show()\n" ] }, { "cell_type": "markdown", "id": "d689a211", "metadata": {}, "source": [ "#### Where they disagree, it is almost always a blank, not a contradiction\n", "\n", "TM trade tonnage that has no SCL counterpart decomposes into three very different cases:\n", "\n", "1. **The item is not tracked by SCL at all** (~9% of own-reported tonnage on each side). SCL covers a subset of TM's 500+ items.\n", "2. **The country is not tracked by SCL at all.** Ten TM-reporting countries are entirely absent from SCL, including Japan and Singapore, two of the world's largest importers. Japan alone has ~15 million tonnes of own-reported maize imports with no SCL record.\n", "3. **The (country, item) pair is tracked but the element is blank.** This case is negligible: ~0.2 Mt of imports and ~0.9 Mt of exports in total. When a country files customs reports, SCL virtually always includes them.\n", "\n", "The converse also happens: SCL reports positive trade for thousands of (country, item) pairs where the country filed nothing to TM. SCL draws on partner (mirror) data and estimates to cover countries that do not report customs data themselves.\n" ] }, { "cell_type": "code", "execution_count": 21, "id": "127b58f9", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:29.474324Z", "iopub.status.busy": "2026-06-18T15:26:29.474262Z", "iopub.status.idle": "2026-06-18T15:26:29.510756Z", "shell.execute_reply": "2026-06-18T15:26:29.510323Z" }, "tags": [ "collapsed" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "imports: total own-reported TM tonnage 1,583 Mt\n", " item not in SCL: 5908 pairs, 135.9 Mt (8.6%)\n", " country not in SCL: 2162 pairs, 61.0 Mt (3.9%)\n", " tracked pair, blank value: 809 pairs, 0.2 Mt (0.0%)\n", "exports: total own-reported TM tonnage 1,685 Mt\n", " item not in SCL: 3994 pairs, 155.3 Mt (9.2%)\n", " country not in SCL: 1074 pairs, 5.2 Mt (0.3%)\n", " tracked pair, blank value: 843 pairs, 0.9 Mt (0.1%)\n", "\n", "TM reporters absent from SCL: ['Benin', 'Brunei', 'Burundi', 'Central African Republic', 'Dominica', 'Japan', 'Mali', 'Palestine', 'Singapore', 'Togo']\n" ] } ], "source": [ "scl_items = set(scl_trade[\"item_code\"])\n", "scl_countries = set(scl_trade[\"country\"])\n", "\n", "for side in [\"imports\", \"exports\"]:\n", " t, s = trade[f\"tm_{side}\"], trade[f\"scl_{side}\"]\n", " blank = (t > 0) & s.isna()\n", " item_untracked = blank & ~trade[\"item_code\"].isin(scl_items)\n", " country_untracked = blank & trade[\"item_code\"].isin(scl_items) & ~trade[\"country\"].isin(scl_countries)\n", " element_blank = blank & trade[\"item_code\"].isin(scl_items) & trade[\"country\"].isin(scl_countries)\n", " total = t.fillna(0).sum()\n", " print(f\"{side}: total own-reported TM tonnage {total / 1e6:,.0f} Mt\")\n", " print(f\" item not in SCL: {item_untracked.sum():>5} pairs, {t[item_untracked].sum() / 1e6:>6.1f} Mt ({t[item_untracked].sum() / total:.1%})\")\n", " print(f\" country not in SCL: {country_untracked.sum():>5} pairs, {t[country_untracked].sum() / 1e6:>6.1f} Mt ({t[country_untracked].sum() / total:.1%})\")\n", " print(f\" tracked pair, blank value: {element_blank.sum():>5} pairs, {t[element_blank].sum() / 1e6:>6.1f} Mt ({t[element_blank].sum() / total:.1%})\")\n", "\n", "# TM reporters entirely absent from SCL.\n", "tm_reporters = set(trade.loc[trade[\"tm_imports\"].notna() | trade[\"tm_exports\"].notna(), \"country\"])\n", "absent = sorted(tm_reporters - scl_countries)\n", "print(f\"\\nTM reporters absent from SCL: {absent}\")\n", "assert \"Japan\" in absent and \"Singapore\" in absent and len(absent) == 10\n", "\n", "# Tracked-pair blanks are negligible on both sides.\n", "for side in [\"imports\", \"exports\"]:\n", " t, s = trade[f\"tm_{side}\"], trade[f\"scl_{side}\"]\n", " element_blank = (t > 0) & s.isna() & trade[\"item_code\"].isin(scl_items) & trade[\"country\"].isin(scl_countries)\n", " assert t[element_blank].sum() / t.fillna(0).sum() < 0.005\n" ] }, { "cell_type": "markdown", "id": "1f6008d2", "metadata": {}, "source": [ "#### The corollary: mirror-only flows have no SCL counterpart\n", "\n", "Our trade-flow table prefers the importer-reported quantity and falls back to the exporter-reported quantity when the importer is silent.\n", "Those fallback flows are precisely the trade that the importing country never declared anywhere, so SCL (which mirrors the country's own reports) has no record of them either.\n", "\n", "A concrete example: Mexico reported zero barley imports of its own in 2023, in both TM and SCL. Its partners, however, declared ~518 thousand tonnes of barley exports to Mexico. Any quantity we display for that flow exists only in mirror data.\n", "\n", "In the final table, about a quarter of flows (but only ~9% of tonnage) are mirror-only.\n", "This matters when relating TM flows to SCL-derived figures (such as domestic supply): for mirror-only flows, the SCL side of the comparison is structurally blank, and treating that blank as zero would contradict the very flow being displayed.\n" ] }, { "cell_type": "code", "execution_count": 22, "id": "ef59556d", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:29.511869Z", "iopub.status.busy": "2026-06-18T15:26:29.511809Z", "iopub.status.idle": "2026-06-18T15:26:37.314874Z", "shell.execute_reply": "2026-06-18T15:26:37.314428Z" }, "tags": [ "collapsed" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mexico barley 2023: own-reported imports = 0 t; partners' declared exports to Mexico = 518,218 t\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "food_trade table: mirror-only flows: 25.7% of rows, 10.3% of tonnage\n" ] } ], "source": [ "# Mexico barley (item_code 44): own reports vs mirror reports.\n", "mexico_own = tb[\n", " (tb[\"year\"] == YEAR)\n", " & (tb[\"reporter_country\"].astype(str) == \"Mexico\")\n", " & (tb[\"item_code\"].astype(int) == 44)\n", " & (tb[\"element\"] == \"Import quantity\")\n", "][\"value\"].sum()\n", "mexico_mirror = tb[\n", " (tb[\"year\"] == YEAR)\n", " & (tb[\"partner_country\"].astype(str) == \"Mexico\")\n", " & (tb[\"item_code\"].astype(int) == 44)\n", " & (tb[\"element\"] == \"Export quantity\")\n", "][\"value\"].sum()\n", "print(f\"Mexico barley {YEAR}: own-reported imports = {mexico_own:,.0f} t; partners' declared exports to Mexico = {mexico_mirror:,.0f} t\")\n", "assert mexico_own == 0 and mexico_mirror > 400_000\n", "\n", "# Mirror-only share of the food_trade table (the dataset produced by our garden step).\n", "tb_ft = tb_food_trade\n", "\n", "importer_reported = tb[\n", " (tb[\"year\"] == YEAR) & (tb[\"element\"] == \"Import quantity\") & (tb[\"value\"] > 0)\n", "].copy()\n", "importer_reported[\"importer\"] = importer_reported[\"reporter_country\"].astype(str)\n", "importer_reported[\"exporter\"] = importer_reported[\"partner_country\"].astype(str)\n", "reported_keys = set(zip(importer_reported[\"importer\"], importer_reported[\"exporter\"], importer_reported[\"item_code\"].astype(int)))\n", "\n", "# A display item may combine several FAO codes (see COMBINED_ITEMS), and the food_trade table\n", "# carries only the synthetic id for those, so map each item back to its underlying FAO codes. A\n", "# flow counts as importer-reported when any of the item's codes is reported for that (importer,\n", "# exporter) pair.\n", "display_to_codes = {}\n", "for code, item in curated_code_to_display.items():\n", " display_to_codes.setdefault(item, set()).add(code)\n", "is_mirror = pd.Series(\n", " [\n", " not any((importer, exporter, c) in reported_keys for c in display_to_codes[item])\n", " for importer, exporter, item in zip(\n", " tb_ft[\"importer\"].astype(str), tb_ft[\"exporter\"].astype(str), tb_ft[\"item\"].astype(str)\n", " )\n", " ],\n", " index=tb_ft.index,\n", ")\n", "share_flows = is_mirror.mean()\n", "share_tonnage = tb_ft.loc[is_mirror, \"value\"].sum() / tb_ft[\"value\"].sum()\n", "print(f\"food_trade table: mirror-only flows: {share_flows:.1%} of rows, {share_tonnage:.1%} of tonnage\")\n", "assert 0.15 < share_flows < 0.35 and 0.05 < share_tonnage < 0.15\n" ] }, { "cell_type": "markdown", "id": "dc078a7a", "metadata": {}, "source": [ "#### Conclusions\n", "\n", "* TM and SCL are **not** at odds. Where both report a figure, they agree within ~10% for 94-97% of tonnage. SCL's trade elements are essentially the countries' own customs reports, so combining TM flows with SCL-derived quantities is methodologically sound.\n", "* SCL's gaps are structural, not random noise: items it does not track (~9% of tonnage), ten countries it does not cover at all (including Japan and Singapore), and essentially nothing else. For a tracked (country, item) pair, a blank trade element means the country itself reported nothing anywhere, not that SCL dropped data TM has.\n", "* The one systematic blind spot is **mirror-only trade**: flows declared only by the partner country (~9% of tonnage in our final table). These are real flows with no counterpart on the importer's books, in either dataset. Any computation that fills SCL blanks with zeros (for example, the domestic supply identity) is unreliable exactly where mirror data contradicts the blank, and trustworthy where it does not. This gives a principled, data-backed criterion for when SCL-derived supply can be published alongside TM flows.\n" ] }, { "cell_type": "markdown", "id": "11e30849", "metadata": {}, "source": [ "### Which trade report do we trust, and which does SCL use?\n", "\n", "Most shipments are logged twice, once by the importing country and once by the exporting country, and as we saw above the two numbers often disagree. Since the visualization shows a single figure per flow, we have to choose which to believe.\n", "\n", "The data alone does not settle it. Most tonnage is logged by both sides, and when only one side logs a flow it is split fairly evenly between importers and exporters, so neither is the more complete record-keeper. And when both log a flow, they disagree surprisingly often, with no systematic tilt towards one side being larger." ] }, { "cell_type": "code", "execution_count": 23, "id": "f5cd2437", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:37.316264Z", "iopub.status.busy": "2026-06-18T15:26:37.316187Z", "iopub.status.idle": "2026-06-18T15:26:37.783103Z", "shell.execute_reply": "2026-06-18T15:26:37.782670Z" }, "tags": [ "collapsed" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "agree within 10%: 20% of flows | importer higher: 47% | exporter higher: 49%\n" ] } ], "source": [ "# Build the bilateral flow table for the chosen year: importer's report vs exporter's report per flow.\n", "fy = tb[tb[\"year\"] == YEAR]\n", "exp_f = pd.DataFrame(\n", " fy[fy[\"element\"] == \"Export quantity\"][[\"reporter_country\", \"partner_country\", \"item_code\", \"value\"]]\n", ").rename(columns={\"reporter_country\": \"exporter\", \"partner_country\": \"importer\", \"value\": \"v_exp\"})\n", "imp_f = pd.DataFrame(\n", " fy[fy[\"element\"] == \"Import quantity\"][[\"reporter_country\", \"partner_country\", \"item_code\", \"value\"]]\n", ").rename(columns={\"reporter_country\": \"importer\", \"partner_country\": \"exporter\", \"value\": \"v_imp\"})\n", "flows = exp_f.merge(imp_f, on=[\"exporter\", \"importer\", \"item_code\"], how=\"outer\")\n", "v_exp = np.nan_to_num(flows[\"v_exp\"].to_numpy(float))\n", "v_imp = np.nan_to_num(flows[\"v_imp\"].to_numpy(float))\n", "size = np.maximum(v_exp, v_imp)\n", "both = (v_exp > 0) & (v_imp > 0)\n", "\n", "# Who logs each shipment, by tonnage.\n", "who = pd.DataFrame(\n", " {\n", " \"reported by\": [\"both sides\", \"only the importer\", \"only the exporter\"],\n", " \"share of trade tonnage\": [\n", " size[both].sum() / size.sum(),\n", " size[(v_imp > 0) & (v_exp == 0)].sum() / size.sum(),\n", " size[(v_exp > 0) & (v_imp == 0)].sum() / size.sum(),\n", " ],\n", " }\n", ")\n", "px.bar(who, x=\"reported by\", y=\"share of trade tonnage\", title=f\"Who logs each food shipment? ({YEAR})\").show()\n", "\n", "# When both log it, how far apart are the two numbers?\n", "ratio = v_imp[both] / v_exp[both]\n", "fig = px.histogram(\n", " x=np.log10(ratio),\n", " nbins=60,\n", " title=\"When both sides log a flow, how much do they disagree?\",\n", " labels={\"x\": \"importer-reported / exporter-reported\"},\n", ")\n", "fig.update_xaxes(tickvals=[-2, -1, 0, 1, 2], ticktext=[\"1/100\", \"1/10\", \"equal\", \"10x\", \"100x\"])\n", "fig.show()\n", "print(f\"agree within 10%: {(np.minimum(v_imp[both], v_exp[both]) / np.maximum(v_imp[both], v_exp[both]) >= 0.9).mean():.0%}\"\n", " f\" of flows | importer higher: {(v_imp[both] > v_exp[both]).mean():.0%} | exporter higher: {(v_exp[both] > v_imp[both]).mean():.0%}\")\n" ] }, { "cell_type": "markdown", "id": "fac001bf", "metadata": {}, "source": [ "So we lean on the importer, for two reasons.\n", "\n", "The first is data quality: customs authorities scrutinise incoming goods more closely than outgoing ones (duties, quarantine checks, quotas), so the importer's figure is usually the better measured. FAOSTAT says as much in the [Food Balance Sheets and Supply Utilization Accounts Resource Handbook 2025](https://openknowledge.fao.org/items/30e641b1-6d57-4998-8e8a-395727aa4307), §6.1:\n", "\n", "> \"[...] mirror data can also be used to reconcile values reported by the reporting country with the same trade flows reported by the trading partner. This is especially useful for exports, as imports are typically documented more thoroughly and verified more rigorously than exports (UNSD, 2013).\"\n", "\n", "This means SCL itself is built importer-first: it uses a country's own reports, and falls back to partner (\"mirror\") data only for non-reporting countries. We can confirm it from the data. In the (country, item) pairs where a country's own import report and its partners' export reports disagree by more than 25%, SCL's import figure sides with the importer's own number the overwhelming majority of the time." ] }, { "cell_type": "code", "execution_count": 24, "id": "4f66aede", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:37.784245Z", "iopub.status.busy": "2026-06-18T15:26:37.784176Z", "iopub.status.idle": "2026-06-18T15:26:39.761748Z", "shell.execute_reply": "2026-06-18T15:26:39.761418Z" }, "tags": [ "collapsed" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Does SCL's import figure follow the importer's own reports, or the exporters' (mirror) view?\n", "own_imp = pd.DataFrame(\n", " fy[fy[\"element\"] == \"Import quantity\"].groupby([\"reporter_country\", \"item_code\"], observed=True)[\"value\"].sum()\n", ").reset_index().rename(columns={\"reporter_country\": \"country\", \"value\": \"own\"})\n", "mirror_imp = pd.DataFrame(\n", " fy[fy[\"element\"] == \"Export quantity\"].groupby([\"partner_country\", \"item_code\"], observed=True)[\"value\"].sum()\n", ").reset_index().rename(columns={\"partner_country\": \"country\", \"value\": \"mirror\"})\n", "scl_imp = tb_scl[(tb_scl[\"year\"] == YEAR) & (tb_scl[\"unit_short_name\"] == \"t\") & (tb_scl[\"element\"].astype(str) == \"Import quantity\")].copy()\n", "scl_imp[\"country\"] = scl_imp[\"country\"].astype(str)\n", "scl_imp[\"item_code\"] = scl_imp[\"item_code\"].astype(str).astype(int)\n", "scl_imp = pd.DataFrame(scl_imp.groupby([\"country\", \"item_code\"], observed=True)[\"value\"].sum()).reset_index().rename(columns={\"value\": \"scl\"})\n", "\n", "cmp = own_imp.merge(mirror_imp, on=[\"country\", \"item_code\"]).merge(scl_imp, on=[\"country\", \"item_code\"])\n", "own = cmp[\"own\"].to_numpy(float); mir = cmp[\"mirror\"].to_numpy(float); scl = cmp[\"scl\"].to_numpy(float)\n", "ok = (own > 0) & (mir > 0) & (scl > 0)\n", "own, mir, scl = own[ok], mir[ok], scl[ok]\n", "disagree = np.maximum(own, mir) / np.minimum(own, mir) > 1.25\n", "closer_to_importer = np.abs(scl - own) < np.abs(scl - mir)\n", "w = np.maximum(own, mir)[disagree]\n", "ci = closer_to_importer[disagree]\n", "res = pd.DataFrame(\n", " {\n", " \"SCL's import figure is closer to\": [\"the importer's own report\", \"the exporters' (mirror) report\"],\n", " \"share of cases\": [ci.mean(), 1 - ci.mean()],\n", " \"share of tonnage\": [w[ci].sum() / w.sum(), w[~ci].sum() / w.sum()],\n", " }\n", ").melt(id_vars=\"SCL's import figure is closer to\", var_name=\"measured\", value_name=\"share\")\n", "px.bar(\n", " res, x=\"SCL's import figure is closer to\", y=\"share\", color=\"measured\", barmode=\"group\",\n", " title=f\"When importer and exporter disagree (>25%), whom does SCL believe? ({YEAR}, {disagree.sum():,} pairs)\",\n", ").show()\n" ] }, { "cell_type": "markdown", "id": "8276e95f", "metadata": {}, "source": [ "There is a neat consequence. Our domestic-supply figure is SCL's, and its import component is importer-sourced. So by drawing the flows importer-first as well, the imports we display are the same imports already baked into the supply we divide by. The numerator and denominator of \"imports as a share of supply\" speak the same language, rather than mixing two different counts of the same trade." ] }, { "cell_type": "markdown", "id": "53b02ef1", "metadata": {}, "source": [ "### Showing domestic supply only where the trade data backs it up\n", "\n", "SCL's import figures are usually complete, but not always. When a country under-reports its own imports, SCL inherits the gap, and the domestic supply we compute from it comes out too low, which would make imports look like an impossibly large share of supply.\n", "\n", "Mexico and barley in 2023 is a clean example. SCL records Mexican barley production but no imports, giving a domestic supply of about 0.9 million tonnes. Yet the trade matrix shows ~0.5 million tonnes of barley shipped into Mexico that year (its partners reported the exports; Mexico itself did not). Trust SCL's supply and you would tell readers imports are 58% of Mexico's barley supply, when the real figure is nearer 37%.\n", "\n", "We can catch this, because the inbound flows in the trade matrix are an independent read on each country's imports. So we keep the supply figure only where SCL's recorded imports broadly match the trade we observe flowing in, and blank it where they fall well short. As before, the trade data is used only to *decide whether to trust* SCL's supply, never to change it.\n", "\n", "How strict should \"broadly match\" be? The chart below shows how much supply coverage survives as we tighten the bar, applying the same test to imports, to exports, and to both." ] }, { "cell_type": "code", "execution_count": 25, "id": "12db2f4d", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:39.763171Z", "iopub.status.busy": "2026-06-18T15:26:39.763103Z", "iopub.status.idle": "2026-06-18T15:26:41.898764Z", "shell.execute_reply": "2026-06-18T15:26:41.898379Z" }, "tags": [ "collapsed" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Coverage of domestic supply that survives the trade cross-check, as a function of the threshold.\n", "flows[\"value\"] = flows[\"v_imp\"].fillna(flows[\"v_exp\"])\n", "fl = flows[flows[\"value\"] > 0]\n", "tm_in = pd.DataFrame(fl.groupby([\"importer\", \"item_code\"], observed=True)[\"value\"].sum()).reset_index().rename(columns={\"importer\": \"country\", \"value\": \"tm_in\"})\n", "tm_out = pd.DataFrame(fl.groupby([\"exporter\", \"item_code\"], observed=True)[\"value\"].sum()).reset_index().rename(columns={\"exporter\": \"country\", \"value\": \"tm_out\"})\n", "\n", "comp = [\"Production\", \"Import quantity\", \"Export quantity\", \"Stock Variation\"]\n", "sc = tb_scl[(tb_scl[\"year\"] == YEAR) & (tb_scl[\"unit_short_name\"] == \"t\") & tb_scl[\"element\"].astype(str).isin(comp)].copy()\n", "sc[\"country\"] = sc[\"country\"].astype(str); sc[\"item_code\"] = sc[\"item_code\"].astype(str).astype(int); sc[\"element\"] = sc[\"element\"].astype(str)\n", "sw = pd.DataFrame(sc.groupby([\"country\", \"item_code\", \"element\"], observed=True)[\"value\"].sum().unstack(\"element\").reset_index())\n", "sw[\"supply\"] = sw[\"Production\"].fillna(0) + sw[\"Import quantity\"].fillna(0) - sw[\"Export quantity\"].fillna(0) - sw[\"Stock Variation\"].fillna(0)\n", "sw = sw.merge(tm_in, on=[\"country\", \"item_code\"], how=\"left\").merge(tm_out, on=[\"country\", \"item_code\"], how=\"left\")\n", "sw[\"tm_in\"] = sw[\"tm_in\"].fillna(0.0); sw[\"tm_out\"] = sw[\"tm_out\"].fillna(0.0)\n", "sw = sw[sw[\"supply\"] > 0]\n", "si = sw[\"Import quantity\"].fillna(0).to_numpy(float); se = sw[\"Export quantity\"].fillna(0).to_numpy(float)\n", "tin = sw[\"tm_in\"].to_numpy(float); tout = sw[\"tm_out\"].to_numpy(float)\n", "\n", "thr = np.linspace(0, 1, 51)\n", "def kept(scl_flow, observed, t):\n", " return (observed == 0) | (scl_flow >= t * observed)\n", "rows = []\n", "for t in thr:\n", " ki = kept(si, tin, t); ke = kept(se, tout, t); kb = ki & ke\n", " for label, mask in [(\"imports\", ki), (\"exports\", ke), (\"both\", kb)]:\n", " rows.append({\"threshold\": t * 100, \"check\": label, \"tonnage kept\": tin[mask].sum() / tin.sum() * 100})\n", "cov = pd.DataFrame(rows)\n", "fig = px.line(cov, x=\"threshold\", y=\"tonnage kept\", color=\"check\",\n", " title=f\"Domestic-supply coverage surviving the trade cross-check ({YEAR}, all items)\",\n", " labels={\"threshold\": \"SCL flow must cover ≥ this % of observed trade\", \"tonnage kept\": \"% of inbound tonnage with supply\"})\n", "fig.add_vline(x=90, line_dash=\"dash\", line_color=\"gray\", annotation_text=\"chosen: 90%\")\n", "fig.update_yaxes(range=[0, 101])\n", "fig.show()\n" ] }, { "cell_type": "markdown", "id": "eda02d34", "metadata": {}, "source": [ "We set the bar at **90%**: SCL's imports must cover at least 90% of the observed inbound trade for that country and item, otherwise the supply is left blank. That is the natural cut. Where SCL and the trade matrix are both the importer's own data (well-reporting countries), they agree within about 10% anyway, so a larger gap signals a genuine hole rather than the normal CIF/FOB, timing and classification wobble. Tightening much beyond 90% starts discarding good supplies for that ordinary noise (the steep drop near 100% in the chart); loosening it lets through supplies that are understated by a median of ~18%. At 90%, blanking removes a small slice of inbound tonnage and leaves the well-reported bulk intact.\n", "\n", "We apply this test to **imports only**. Checking the export leg looks attractive but misfires, as the chart shows it is far more aggressive. The reason is that for a big net importer, exports are immaterial to supply: China imported ~103 million tonnes of soybeans in 2023 against domestic supply of ~121 million tonnes, while exporting just 71 thousand tonnes (0.06% of supply). A trivial mismatch in that tiny export figure would otherwise veto one of the most important food-supply numbers in the world. So the export check is left out, and supply is validated on the import leg, the one that actually moves the denominator." ] }, { "cell_type": "markdown", "id": "meatcut-intro", "metadata": {}, "source": [ "# Combining items\n", "\n", "FAOSTAT often reports a single commodity under more than one item code, splitting a primary form from a mechanically-derived one: meat with the bone vs. boneless, nuts in their shell vs. shelled kernels, paddy rice vs. milled rice, raw vs. refined sugar. International trade is dominated by the *derived* form (you ship deboned meat, shelled kernels, milled rice, refined sugar), so the primary code on its own captures only a fraction of what a reader would call \"the trade in that commodity\".\n", "\n", "The charts below show, grouped by commodity family, how each combined item's global trade splits across its FAO codes in 2023. The primary code (listed first) is frequently the minority: bone-in beef is about a fifth of beef trade, in-shell almonds under a third of almond trade, and the old single \"Rice\" code we previously used was a near-empty residual next to milled rice.\n", "\n", "So for these items we **sum the trade of all their codes** to recover the full bilateral flow. Summing the *trade* does not double-count, because the codes are distinct shipments under distinct customs headings. We do **not** sum their production or domestic supply, and we show none for these items: the derived form is processed *from* the primary (FAOSTAT records the in-shell almonds that are then shelled, the paddy that is then milled), so adding the two would count the same harvest twice, and they sit on incompatible weight bases in any case. In the published data each combined item carries a synthetic id of `100000 + its first code` (beef = `100867`), so the id is never mistaken for a single FAO commodity." ] }, { "cell_type": "code", "execution_count": 26, "id": "meatcut-chart", "metadata": { "execution": { "iopub.execute_input": "2026-06-18T15:26:41.900049Z", "iopub.status.busy": "2026-06-18T15:26:41.899971Z", "iopub.status.idle": "2026-06-18T15:26:42.080028Z", "shell.execute_reply": "2026-06-18T15:26:42.079554Z" }, "tags": [ "collapsed" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Group the combined items by commodity family so each chart shares one set of FAO forms.\n", "ITEM_GROUPS = {\n", " \"Meat (bone-in vs. boneless)\": [\"Beef\", \"Pork\"],\n", " \"Nuts (in-shell vs. shelled)\": [\"Almonds\", \"Cashews\", \"Walnuts\", \"Hazelnuts\", \"Brazil nuts\", \"Groundnuts\"],\n", " \"Rice (milled, broken, husked)\": [\"Rice\"],\n", " \"Sugar (raw vs. refined)\": [\"Sugar\"],\n", "}\n", "\n", "# Global trade per FAO code (2023): the larger of the two reported directions, summed to the world total.\n", "year_t = tb[(tb[\"year\"] == YEAR) & (tb[\"unit\"] == \"t\")]\n", "world = year_t[year_t[\"element\"].isin([\"Export quantity\", \"Import quantity\"])]\n", "by_code = world.groupby([\"item_code\", \"element\"], observed=True)[\"value\"].sum().unstack(\"element\").max(axis=1)\n", "\n", "\n", "def split_frame(items):\n", " rows = []\n", " for item in items:\n", " members = COMBINED_ITEMS[item]\n", " total = sum(float(by_code.get(code, 0.0)) for code in members)\n", " for code, form in members.items():\n", " rows.append({\"item\": item, \"form\": form, \"share\": 100 * float(by_code.get(code, 0.0)) / total,\n", " \"mt\": float(by_code.get(code, 0.0)) / 1e6, \"primary\": code == next(iter(members))})\n", " return pd.DataFrame(rows)\n", "\n", "\n", "for group, items in ITEM_GROUPS.items():\n", " frame = split_frame(items)\n", " # Order items by the primary code's share, so the split reads as a gradient.\n", " order = frame[frame[\"primary\"]].sort_values(\"share\", ascending=False)[\"item\"].tolist()\n", " fig = px.bar(\n", " frame, x=\"share\", y=\"item\", color=\"form\", orientation=\"h\",\n", " category_orders={\"item\": order},\n", " hover_data={\"mt\": \":.2f\", \"share\": \":.1f\", \"primary\": False},\n", " labels={\"share\": \"Share of the commodity's traded tonnes (%)\", \"item\": \"\", \"form\": \"FAO form\"},\n", " title=f\"{group}: how trade splits across FAO codes, {YEAR}\",\n", " )\n", " fig.update_layout(height=200 + 55 * len(items), width=850, barmode=\"stack\", legend_title_text=\"FAO form\")\n", " fig.show()" ] }, { "cell_type": "markdown", "id": "meatcut-takeaway", "metadata": {}, "source": [ "Reading across the charts: the primary FAO code, the one a \"single code per commodity\" rule would pick, is the minority of trade for most of these items, and for beef, almonds, hazelnuts, groundnuts and brazil nuts it is well under half. Rice is the starkest: the code we previously curated held almost none of the traded volume, which sat in milled (and broken) rice instead. Summing each commodity's codes is what makes the visualization reflect the trade people actually mean, while dropping the (non-comparable, double-counting) production and supply for these items keeps the rest of the figures honest." ] } ], "metadata": { "jupytext": { "main_language": "python" }, "kernelspec": { "display_name": "etl", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }