{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "95f0a171",
   "metadata": {},
   "source": [
    "(joins)=\n",
    "# Joins\n",
    "\n",
    "## Introduction\n",
    "\n",
    "It's rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must *join* them together to answer the questions that you're interested in.\n",
    "\n",
    "**pandas** has a really rich set of options for combining one or more data frames, with the two most important being concatenate and merge. Some of the examples in this chapter show you how to join a pair of data frames. Fortunately this is enough, since you can combine three data frames by combining two pairs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51a55374",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# remove cell\n",
    "import matplotlib_inline.backend_inline\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Plot settings\n",
    "plt.style.use(\"https://github.com/aeturrell/python4DS/raw/main/plot_style.txt\")\n",
    "matplotlib_inline.backend_inline.set_matplotlib_formats(\"svg\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17575f3a",
   "metadata": {},
   "source": [
    "### Prerequisites\n",
    "\n",
    "This chapter will use the **pandas** data analysis package."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80710c6e",
   "metadata": {},
   "source": [
    "## Concatenate\n",
    "\n",
    "If you have two or more data frames with the same index or the same columns, you can glue them together into a single data frame using `pd.concat()`.\n",
    "\n",
    "![](https://pandas.pydata.org/docs/_images/08_concat_row.svg)\n",
    "\n",
    "For the same columns, pass `axis=0` to glue the index together; for the same index, pass `axis=1` to glue the columns together. The concatenate function will typically be used on a list of data frames.\n",
    "\n",
    "If you want to track where the original data came from in the final data frame, use the `keys` keyword.\n",
    "\n",
    "Here's an example using data on two different states' populations that also makes uses of the `keys` option:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5ef4f37",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "base_url = \"http://www.stata-press.com/data/r14/\"\n",
    "state_codes = [\"ca\", \"il\"]\n",
    "end_url = \"pop.dta\"\n",
    "\n",
    "# This grabs the two data frames, one for each state\n",
    "list_of_state_dfs = [pd.read_stata(base_url + state + end_url) for state in state_codes]\n",
    "# Show example of first entry in list of data frames\n",
    "print(list_of_state_dfs[0])\n",
    "\n",
    "# Concatenate the list of data frames\n",
    "df = pd.concat(list_of_state_dfs, keys=state_codes, axis=0)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0c343b0",
   "metadata": {},
   "source": [
    "Note that the `keys` argument is optional, but is useful for keeping track of origin data frames within the merged data frame.\n",
    "\n",
    "````{admonition} Exercise\n",
    "Concatenate the follow two data frames:\n",
    "\n",
    "```python\n",
    "df1 = pd.DataFrame([['a', 1], ['b', 2]],\n",
    "                   columns=['letter', 'number'])\n",
    "\n",
    "df2 = pd.DataFrame([['c', 3], ['d', 4]],\n",
    "                   columns=['letter', 'number'])\n",
    "```\n",
    "\n",
    "````"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6e91777",
   "metadata": {},
   "source": [
    "## Merge\n",
    "\n",
    "There are so many options for merging data frames using `pd.merge(left, right, on=..., how=...` that we won't be able to cover them all here. The most important features are: the two data frames to be merged, what variables (aka keys) to merge on (and these can be indexes) via `on=`, and *how* to do the merge (eg left, right, outer, inner) via `how=`. This diagram shows an example of a merge using keys from the left-hand data frame:\n",
    "\n",
    "![](https://pandas.pydata.org/docs/_images/08_merge_left.svg)\n",
    "\n",
    "The `how=` keyword works in the following ways:\n",
    "- `how='left'` uses keys from the left data frame only to merge.\n",
    "- `how='right'` uses keys from the right data frame only to merge.\n",
    "- `how='inner'` uses keys that appear in both data frames to merge.\n",
    "- `how='outer'` uses the cartesian product of keys in both data frames to merge on.\n",
    "\n",
    "Let's see examples of some of these:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53c66d5d",
   "metadata": {},
   "outputs": [],
   "source": [
    "left = pd.DataFrame(\n",
    "    {\n",
    "        \"key1\": [\"K0\", \"K0\", \"K1\", \"K2\"],\n",
    "        \"key2\": [\"K0\", \"K1\", \"K0\", \"K1\"],\n",
    "        \"A\": [\"A0\", \"A1\", \"A2\", \"A3\"],\n",
    "        \"B\": [\"B0\", \"B1\", \"B2\", \"B3\"],\n",
    "    }\n",
    ")\n",
    "right = pd.DataFrame(\n",
    "    {\n",
    "        \"key1\": [\"K0\", \"K1\", \"K1\", \"K2\"],\n",
    "        \"key2\": [\"K0\", \"K0\", \"K0\", \"K0\"],\n",
    "        \"C\": [\"C0\", \"C1\", \"C2\", \"C3\"],\n",
    "        \"D\": [\"D0\", \"D1\", \"D2\", \"D3\"],\n",
    "    }\n",
    ")\n",
    "# Right merge\n",
    "pd.merge(left, right, on=[\"key1\", \"key2\"], how=\"right\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8af0241e",
   "metadata": {},
   "source": [
    "Note that the key combination of K2 and K0 did not exist in the left-hand data frame, and so its entries in the final data frame are NaNs. But it *does* have entries because we chose the keys from the right-hand data frame.\n",
    "\n",
    "What about an inner merge?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e73608f",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.merge(left, right, on=[\"key1\", \"key2\"], how=\"inner\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67ac74c0",
   "metadata": {},
   "source": [
    "Now we see that the combination K2 and K0 are excluded because they didn't exist in the overlap of keys in both data frames.\n",
    "\n",
    "Finally, let's take a look at an outer merge that comes with some extra info via the `indicator` keyword:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d209fb9",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.merge(left, right, on=[\"key1\", \"key2\"], how=\"outer\", indicator=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67ca901b",
   "metadata": {},
   "source": [
    "Now we can see that the products of all key combinations are here. The `indicator=True` option has caused an extra column to be added, called '_merge', that tells us which data frame the keys on that row came from.\n",
    "\n",
    "````{admonition} Exercise\n",
    "\n",
    "Merge the following two data frames using the `left_on` and `right_on` keyword arguments to specify a join on `lkey` and `rkey` respectively:\n",
    "\n",
    "```python\n",
    "df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],\n",
    "                    'value': [1, 2, 3, 5]})\n",
    "df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],\n",
    "                    'value': [5, 6, 7, 8]})\n",
    "```\n",
    "````\n",
    "\n",
    "````{admonition} Exercise\n",
    "\n",
    "Merge the following two data frames on `\"a\"` using `how=\"left\"` as a keyword argument:\n",
    "\n",
    "```python\n",
    "df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})\n",
    "df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})\n",
    "```\n",
    "\n",
    "What do you notice about the position `.loc[1, \"c\"]` in the merged data frame? \n",
    "````\n",
    "\n",
    "For more on the options for merging, see **pandas**' comprehensive [merging documentation](https://pandas.pydata.org/docs/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging)."
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "9d7534ecd9fbc7d385378f8400cf4d6cb9c6175408a574f1c99c5269f08771cc"
  },
  "jupytext": {
   "cell_metadata_filter": "-all",
   "encoding": "# -*- coding: utf-8 -*-",
   "formats": "md:myst",
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "toc-showtags": true
 },
 "nbformat": 4,
 "nbformat_minor": 5
}