{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting all available snapshots of a particular page from the Internet Archive – Timemap or CDX?\n",
    "\n",
    "<p class=\"alert alert-info\">New to Jupyter notebooks? Try <a href=\"getting-started/Using_Jupyter_notebooks.ipynb\"><b>Using Jupyter notebooks</b></a> for a quick introduction.</p>\n",
    "\n",
    "There are a couple of ways of getting a list of the available snapshots for a particular url. In this notebook, we'll compare the Internet Archive's CDX index API, with their Memento Timemap API. Do they give us the same data?\n",
    "\n",
    "See [Exploring the Internet Archive's CDX API](exploring_cdx_api.ipynb) for more information about the CDX API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import requests"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get the data for comparison"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def query_timemap(url):\n",
    "    \"\"\"\n",
    "    Get a Timemap in JSON format for the specified url.\n",
    "    \"\"\"\n",
    "    response = requests.get(\n",
    "        f\"https://web.archive.org/web/timemap/json/{url}\", headers={\"User-Agent\": \"\"}\n",
    "    )\n",
    "    response.raise_for_status()\n",
    "    return response.json()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def query_cdx(url, **kwargs):\n",
    "    \"\"\"\n",
    "    Query the IA CDX API for the supplied url.\n",
    "    You can optionally provide any of the parameters accepted by the API.\n",
    "    \"\"\"\n",
    "    params = kwargs\n",
    "    params[\"url\"] = url\n",
    "    params[\"output\"] = \"json\"\n",
    "    # User-Agent value is necessary or else IA gives an error\n",
    "    response = requests.get(\n",
    "        \"http://web.archive.org/cdx/search/cdx\",\n",
    "        params=params,\n",
    "        headers={\"User-Agent\": \"\"},\n",
    "    )\n",
    "    response.raise_for_status()\n",
    "    return response.json()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"http://nla.gov.au\"\n",
    "tm_data = query_timemap(url)\n",
    "tm_df = pd.DataFrame(tm_data[1:], columns=tm_data[0])\n",
    "cdx_data = query_cdx(url)\n",
    "cdx_df = pd.DataFrame(cdx_data[1:], columns=cdx_data[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Are the columns the same?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['urlkey',\n",
       " 'timestamp',\n",
       " 'original',\n",
       " 'mimetype',\n",
       " 'statuscode',\n",
       " 'digest',\n",
       " 'length']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(cdx_df.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['urlkey',\n",
       " 'timestamp',\n",
       " 'original',\n",
       " 'mimetype',\n",
       " 'statuscode',\n",
       " 'digest',\n",
       " 'redirect',\n",
       " 'robotflags',\n",
       " 'length',\n",
       " 'offset',\n",
       " 'filename']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(tm_df.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Timemap data includes three extra columns: `robotflags`, `offset`, and `filename`. The `offset` and `filename` columns tell you where to find the snapshot, but I'm not sure what `robotflags` is for (it's not in the [specification](http://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/)). Let's gave a look at what sort of values it has."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-    4404\n",
       "Name: robotflags, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tm_df[\"robotflags\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There's nothing in it – at least for this particular url.\n",
    "\n",
    "For my purposes, it doesn't look like the Timemap adds anything useful."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Do they provide the same number of snapshots?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4404, 11)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tm_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4405, 7)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cdx_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So there are more snapshots in the CDX results than the Timemap. Can we find out what they are?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>urlkey</th>\n",
       "      <th>timestamp</th>\n",
       "      <th>original</th>\n",
       "      <th>mimetype</th>\n",
       "      <th>statuscode</th>\n",
       "      <th>digest</th>\n",
       "      <th>length</th>\n",
       "      <th>redirect</th>\n",
       "      <th>robotflags</th>\n",
       "      <th>offset</th>\n",
       "      <th>filename</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [urlkey, timestamp, original, mimetype, statuscode, digest, length, redirect, robotflags, offset, filename]\n",
       "Index: []"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Combine the two dataframes, then only keep rows that aren't duplicated based on timestamp, original, digest, and statuscode\n",
    "pd.concat([cdx_df, tm_df]).drop_duplicates(\n",
    "    subset=[\"timestamp\", \"original\", \"digest\", \"statuscode\"], keep=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hmm, if there were rows in the `cdx_df` that weren't in the `tm_df` I'd expect them to show up, but there are **no** rows that aren't duplicated based on the `timestamp`, `original`, `digest`, and `statuscode` columns...\n",
    "\n",
    "Let's try this another way, by finding the number of unique shapshots in each df."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4304, 7)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Remove duplicate rows\n",
    "cdx_df.drop_duplicates(\n",
    "    subset=[\"timestamp\", \"digest\", \"statuscode\", \"original\"], keep=\"first\"\n",
    ").shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4304, 11)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Remove duplicate rows\n",
    "tm_df.drop_duplicates(\n",
    "    subset=[\"timestamp\", \"digest\", \"statuscode\", \"original\"], keep=\"first\"\n",
    ").shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ah, so both sets of data contain duplicates, and there are really only **4,304 unique shapshots**. Let's look at some of the duplicates in the CDX data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>urlkey</th>\n",
       "      <th>timestamp</th>\n",
       "      <th>original</th>\n",
       "      <th>mimetype</th>\n",
       "      <th>statuscode</th>\n",
       "      <th>digest</th>\n",
       "      <th>length</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>878</th>\n",
       "      <td>au,gov,nla)/</td>\n",
       "      <td>20090327043759</td>\n",
       "      <td>http://www.nla.gov.au/</td>\n",
       "      <td>text/html</td>\n",
       "      <td>200</td>\n",
       "      <td>537C3S5FANRHGLW3A6WPE6A57LULWNOF</td>\n",
       "      <td>6306</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>879</th>\n",
       "      <td>au,gov,nla)/</td>\n",
       "      <td>20090327043759</td>\n",
       "      <td>http://www.nla.gov.au/</td>\n",
       "      <td>text/html</td>\n",
       "      <td>200</td>\n",
       "      <td>537C3S5FANRHGLW3A6WPE6A57LULWNOF</td>\n",
       "      <td>6473</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>880</th>\n",
       "      <td>au,gov,nla)/</td>\n",
       "      <td>20090515004007</td>\n",
       "      <td>http://www.nla.gov.au/</td>\n",
       "      <td>text/html</td>\n",
       "      <td>200</td>\n",
       "      <td>CC747V3CYGCYQZELL37KNOW5DRPEMFEW</td>\n",
       "      <td>6614</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>881</th>\n",
       "      <td>au,gov,nla)/</td>\n",
       "      <td>20090515004007</td>\n",
       "      <td>http://www.nla.gov.au/</td>\n",
       "      <td>text/html</td>\n",
       "      <td>200</td>\n",
       "      <td>CC747V3CYGCYQZELL37KNOW5DRPEMFEW</td>\n",
       "      <td>6614</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>883</th>\n",
       "      <td>au,gov,nla)/</td>\n",
       "      <td>20090521102300</td>\n",
       "      <td>http://www.nla.gov.au/</td>\n",
       "      <td>text/html</td>\n",
       "      <td>200</td>\n",
       "      <td>25VWCDZDMMC57PLHGKIJ6XUBG566EW33</td>\n",
       "      <td>6619</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>884</th>\n",
       "      <td>au,gov,nla)/</td>\n",
       "      <td>20090521102300</td>\n",
       "      <td>http://www.nla.gov.au/</td>\n",
       "      <td>text/html</td>\n",
       "      <td>200</td>\n",
       "      <td>25VWCDZDMMC57PLHGKIJ6XUBG566EW33</td>\n",
       "      <td>6619</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>885</th>\n",
       "      <td>au,gov,nla)/</td>\n",
       "      <td>20090521230410</td>\n",
       "      <td>http://nla.gov.au/</td>\n",
       "      <td>warc/revisit</td>\n",
       "      <td>-</td>\n",
       "      <td>BDOBBSVBWA4WL3PLC7TSVIA5PE2RZKRD</td>\n",
       "      <td>469</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>886</th>\n",
       "      <td>au,gov,nla)/</td>\n",
       "      <td>20090521230410</td>\n",
       "      <td>http://nla.gov.au/</td>\n",
       "      <td>warc/revisit</td>\n",
       "      <td>-</td>\n",
       "      <td>BDOBBSVBWA4WL3PLC7TSVIA5PE2RZKRD</td>\n",
       "      <td>469</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>887</th>\n",
       "      <td>au,gov,nla)/</td>\n",
       "      <td>20090528133919</td>\n",
       "      <td>http://www.nla.gov.au/</td>\n",
       "      <td>text/html</td>\n",
       "      <td>200</td>\n",
       "      <td>IBVDKIMFCMXC3HU6RFHJOXEOGRKASMTM</td>\n",
       "      <td>6755</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>888</th>\n",
       "      <td>au,gov,nla)/</td>\n",
       "      <td>20090528133919</td>\n",
       "      <td>http://www.nla.gov.au/</td>\n",
       "      <td>text/html</td>\n",
       "      <td>200</td>\n",
       "      <td>IBVDKIMFCMXC3HU6RFHJOXEOGRKASMTM</td>\n",
       "      <td>6755</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           urlkey       timestamp                original      mimetype  \\\n",
       "878  au,gov,nla)/  20090327043759  http://www.nla.gov.au/     text/html   \n",
       "879  au,gov,nla)/  20090327043759  http://www.nla.gov.au/     text/html   \n",
       "880  au,gov,nla)/  20090515004007  http://www.nla.gov.au/     text/html   \n",
       "881  au,gov,nla)/  20090515004007  http://www.nla.gov.au/     text/html   \n",
       "883  au,gov,nla)/  20090521102300  http://www.nla.gov.au/     text/html   \n",
       "884  au,gov,nla)/  20090521102300  http://www.nla.gov.au/     text/html   \n",
       "885  au,gov,nla)/  20090521230410      http://nla.gov.au/  warc/revisit   \n",
       "886  au,gov,nla)/  20090521230410      http://nla.gov.au/  warc/revisit   \n",
       "887  au,gov,nla)/  20090528133919  http://www.nla.gov.au/     text/html   \n",
       "888  au,gov,nla)/  20090528133919  http://www.nla.gov.au/     text/html   \n",
       "\n",
       "    statuscode                            digest length  \n",
       "878        200  537C3S5FANRHGLW3A6WPE6A57LULWNOF   6306  \n",
       "879        200  537C3S5FANRHGLW3A6WPE6A57LULWNOF   6473  \n",
       "880        200  CC747V3CYGCYQZELL37KNOW5DRPEMFEW   6614  \n",
       "881        200  CC747V3CYGCYQZELL37KNOW5DRPEMFEW   6614  \n",
       "883        200  25VWCDZDMMC57PLHGKIJ6XUBG566EW33   6619  \n",
       "884        200  25VWCDZDMMC57PLHGKIJ6XUBG566EW33   6619  \n",
       "885          -  BDOBBSVBWA4WL3PLC7TSVIA5PE2RZKRD    469  \n",
       "886          -  BDOBBSVBWA4WL3PLC7TSVIA5PE2RZKRD    469  \n",
       "887        200  IBVDKIMFCMXC3HU6RFHJOXEOGRKASMTM   6755  \n",
       "888        200  IBVDKIMFCMXC3HU6RFHJOXEOGRKASMTM   6755  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dupes = cdx_df.loc[\n",
    "    cdx_df.duplicated(subset=[\"timestamp\", \"digest\"], keep=False)\n",
    "].sort_values(by=\"timestamp\")\n",
    "dupes.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Date range of duplicates: 20090327043759 to 20210302172503\n"
     ]
    }
   ],
   "source": [
    "print(\n",
    "    f'Date range of duplicates: {dupes[\"timestamp\"].min()} to {dupes[\"timestamp\"].max()}'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So it seems they provide the same number of unique snapshots, but the CDX index adds a few more duplicates."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Is there a difference in speed?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4.99 s ± 1.39 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "tm_data = query_timemap(url)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.34 s ± 261 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "cdx_data = query_cdx(url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both methods provide much the same data, so it just comes down to convenience and performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n",
    "\n",
    "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/).\n",
    "\n",
    "The Web Archives section of the GLAM Workbench is sponsored by the [British Library](https://www.bl.uk/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}