{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Harvesting data about a domain using the IA CDX API\n", "\n", "

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

\n", "\n", "In this notebook we'll look at how we can get domain level data from a CDX API. There are two types of search you can use:\n", "\n", "* a 'prefix' query – searching for `nla.gov.au/*` returns captures from the `nla.gov.au` domain\n", "* a 'domain' query – searching for `*.nla.gov.au` returns captures from the `nla.gov.au` domain *and any subdomains*\n", "\n", "These searches can be combined with any of the other filters supported by the CDX API, such as `mimetype` and `statuscode`.\n", "\n", "As noted in [Comparing CDX APIs](comparing_cdx_apis.ipynb), support for domain level searching varies across systems. The AWA allows prefix queries, but not domain queries. The UKWA provides both in theory, but timeouts are common for large domains. Neither the AWA or UKWA supports pagination, so harvesting data from large domains can cause difficulties. For these reasons it seems sensible to focus on the IA CDX API, unless you're after data from a single, modestly-sized domain.\n", "\n", "Related notebooks:\n", "\n", "* [Exploring the Internet Archive's CDX API](exploring_cdx_api.ipynb)\n", "* [Comparing CDX APIs](comparing_cdx_apis.ipynb)\n", "* [Find all the archived versions of a web page](find_all_captures.ipynb) – shows how to use an 'exact' search with the CDX API\n", "* [Find and explore Powerpoint presentations from a specific domain](explore_presentations.ipynb) – example of finding particular types of files within a domain\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In most other notebooks using the CDX API we've harvested data into memory and then saved to disk later on. Because we're potentially harvesting much larger quantities of data, it's probably a good idea to reverse this and save harvested data to disk as we download it. We can also use `requests-cache` to save responses from the API and make it easy to restart a failed harvest. This is the same strategy used in the [Exploring subdomains in the gov.au domain](harvesting_gov_au_domains.ipynb) notebook where I harvest data about 189 million captures.\n", "\n", "### Usage\n", "\n", "#### Prefix query\n", "\n", "Either using a url wildcard:\n", "\n", "```\n", "harvest_cdx_query_to_file('[domain]/*', [optional parameters])\n", "```\n", "\n", "or the `matchType` parameter:\n", "\n", "```\n", "harvest_cdx_query_to_file('[domain]', matchType='prefix', [optional parameters])\n", "```\n", "\n", "#### Domain query\n", "\n", "Either using a url wildcard:\n", "\n", "```\n", "harvest_cdx_query_to_file('*.[domain]', [optional parameters])\n", "```\n", "\n", "or the `matchType` parameter:\n", "\n", "```\n", "harvest_cdx_query_to_file('[domain]', matchType='domain', [optional parameters])\n", "```\n", "\n", "### Output\n", "\n", "The results of each harvest are stored in a timestamped `.ndjson` file in a subdirectory of the `domains` directory. For example, a harvest from `nla.gov.au` is stored in `domains/nla-gov-au`. The file names combine the domain, the type of query (either 'prefix' or 'domain') and a timestamp. For example, a prefix query in `nla.gov.au` might generate a file named:\n", "\n", "```\n", "nla-gov-au-prefix-20200526113338.ndjson\n", "```\n", "\n", "Each harvest also creates a metadata file that has a similar name, but is in JSON format, for example:\n", "\n", "```\n", "nla-gov-au-prefix-20200526113338-metadata.json\n", "```\n", "\n", "The metadata file captures information about your harvest including:\n", "\n", "* `params` – the parameters used in your query (including any filters)\n", "* `timestamp` – date and time the harvest was started\n", "* `file` – path to the `ndjson` data file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import what we need" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import time\n", "from pathlib import Path\n", "\n", "import arrow\n", "import ndjson\n", "import pandas as pd\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from requests_cache import CachedSession\n", "from slugify import slugify\n", "from tqdm.auto import tqdm\n", "\n", "# By using a cached session, all responses will be saved in a local cache\n", "s = CachedSession()\n", "retries = Retry(total=10, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define some functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_total_pages(params):\n", " \"\"\"\n", " Gets the total number of pages in a set of results.\n", " \"\"\"\n", " these_params = params.copy()\n", " these_params[\"showNumPages\"] = \"true\"\n", " response = s.get(\n", " \"http://web.archive.org/cdx/search/cdx\",\n", " params=these_params,\n", " headers={\"User-Agent\": \"\"},\n", " )\n", " return int(response.text)\n", "\n", "\n", "def prepare_params(url, **kwargs):\n", " \"\"\"\n", " Prepare the parameters for a CDX API requests.\n", " Adds all supplied keyword arguments as parameters (changing from_ to from).\n", " Adds in a few necessary parameters.\n", " \"\"\"\n", " params = kwargs\n", " params[\"url\"] = url\n", " params[\"output\"] = \"json\"\n", " params[\"pageSize\"] = 5\n", " # CDX accepts a 'from' parameter, but this is a reserved word in Python\n", " # Use 'from_' to pass the value to the function & here we'll change it back to 'from'.\n", " if \"from_\" in params:\n", " params[\"from\"] = params[\"from_\"]\n", " del params[\"from_\"]\n", " return params\n", "\n", "\n", "def convert_lists_to_dicts(results):\n", " \"\"\"\n", " Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.\n", " Renames keys to standardise IA with other Timemaps.\n", " \"\"\"\n", " if results:\n", " keys = results[0]\n", " results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]\n", " else:\n", " results_as_dicts = results\n", " for d in results_as_dicts:\n", " d[\"status\"] = d.pop(\"statuscode\")\n", " d[\"mime\"] = d.pop(\"mimetype\")\n", " d[\"url\"] = d.pop(\"original\")\n", " return results_as_dicts\n", "\n", "\n", "def check_query_type(url):\n", " if url.startswith(\"*\"):\n", " query_type = \"domain\"\n", " elif url.endswith(\"*\"):\n", " query_type = \"prefix\"\n", " else:\n", " query_type = \"\"\n", " return query_type\n", "\n", "\n", "def get_cdx_data(params):\n", " \"\"\"\n", " Make a request to the CDX API using the supplied parameters.\n", " Return results converted to a list of dicts.\n", " \"\"\"\n", " response = s.get(\"http://web.archive.org/cdx/search/cdx\", params=params)\n", " response.raise_for_status()\n", " results = response.json()\n", " try:\n", " if not response.from_cache:\n", " time.sleep(0.2)\n", " except AttributeError:\n", " # Not using cache\n", " time.sleep(0.2)\n", " return convert_lists_to_dicts(results)\n", "\n", "\n", "def save_metadata(output_dir, params, query_type, timestamp, file_path):\n", " md_path = Path(\n", " output_dir, f'{slugify(params[\"url\"])}-{query_type}-{timestamp}-metadata.json'\n", " )\n", " md = {\"params\": params, \"timestamp\": timestamp, \"file\": str(file_path)}\n", " with md_path.open(\"wt\") as md_json:\n", " json.dump(md, md_json)\n", "\n", "\n", "def harvest_cdx_query_to_file(url, **kwargs):\n", " \"\"\"\n", " Harvest capture data from a CDX query.\n", " Save results to a NDJSON formatted file.\n", " \"\"\"\n", " params = prepare_params(url, **kwargs)\n", " total_pages = get_total_pages(params)\n", " output_dir = Path(\"domains\", slugify(url))\n", " output_dir.mkdir(parents=True, exist_ok=True)\n", " # We'll use a timestamp to distinguish between versions\n", " timestamp = arrow.now().format(\"YYYYMMDDHHmmss\")\n", " query_type = params[\"matchType\"] if \"matchType\" in params else check_query_type(url)\n", " file_path = Path(output_dir, f\"{slugify(url)}-{query_type}-{timestamp}.ndjson\")\n", " save_metadata(output_dir, params, query_type, timestamp, file_path)\n", " page = 0\n", " with tqdm(total=total_pages - page) as pbar1:\n", " with tqdm() as pbar2:\n", " while page < total_pages:\n", " params[\"page\"] = page\n", " results = get_cdx_data(params)\n", " with file_path.open(\"a\") as f:\n", " writer = ndjson.writer(f, ensure_ascii=False)\n", " for result in results:\n", " writer.writerow(result)\n", " page += 1\n", " pbar1.update(1)\n", " pbar2.update(len(results) - 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prefix query\n", "\n", "For a 'prefix' query either set the `matchType` parameter to `prefix` or use a url wildcard like `nla.gov.au/*`.\n", "\n", "Get all successful web page captures from the `nla.gov.au` domain." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "harvest_cdx_query_to_file(\n", " \"discontents.com.au/*\", filter=[\"statuscode:200\", \"mimetype:text/html\"]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `collapse` to limit the harvest to remove (most) records with duplicate values for `urlkey`. This should give us a list of unique urls from the `nla.gov.au` domain." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "harvest_cdx_query_to_file(\n", " \"nla.gov.au/*\", filter=[\"statuscode:200\", \"mimetype:text/html\"], collapse=\"urlkey\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Domain query\n", "\n", "For a 'domain' query either set the `matchType` parameter to `domain` or use a url wildcard like `*.nla.gov.au`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "harvest_cdx_query_to_file(\n", " \"*.discontents.com.au\",\n", " filter=[\"statuscode:200\", \"mimetype:text/html\"],\n", " collapse=\"urlkey\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "harvest_cdx_query_to_file(\n", " \"*.nla.gov.au\", filter=[\"statuscode:200\", \"mimetype:text/html\"], collapse=\"urlkey\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should be able to load smaller files using the `ndjson` module. If you're working with large data files (millions of captures) you might not want to load them all into memory. Have a look at [Exploring subdomains in the gov.au domain](harvesting_gov_au_domains.ipynb) for some ways of processing the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Edit to point to your data_file, eg: 'domains/nla-gov-au/nla-gov-au-prefix-20200526123711.ndjson'\n", "data_file = \"[Path to data file]\"\n", "data_file = \"domains/nla-gov-au/nla-gov-au-prefix-20200526123711.ndjson\"\n", "with open(data_file) as f:\n", " capture_data = ndjson.load(f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You could then convert the capture data to a Pandas dataframe for analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "df = pd.DataFrame(capture_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n", "\n", "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/).\n", "\n", "The Web Archives section of the GLAM Workbench is sponsored by the [British Library](https://www.bl.uk/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "vscode": { "interpreter": { "hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a" } }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": { "0794ca95814a43c9a22560df196831cc": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "description_width": "" } }, "0bf61accd93e45f7be8553cfb43a2489": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "layout": "IPY_MODEL_7c0e3d4996ea4bfe9e866e4f3ab1db51", "style": "IPY_MODEL_7c94875645d846b0a8546763924d2dda", "value": " 46/46 [02:41<00:00, 4.53s/it]" } }, "1093e47402c24a46bdef6d3d550c181c": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "FloatProgressModel", "state": { "bar_style": "success", "layout": "IPY_MODEL_f4f305f52bd746ae9de5c5b71c4d46f8", "max": 46, "style": "IPY_MODEL_bc347301928140c3ad0b1b8b634268b2", "value": 46 } }, "14b5d5171fb84460b73a2e6f5e1765f1": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "layout": "IPY_MODEL_a1c206dcba7945d18f47957a6fc6d245", "style": "IPY_MODEL_c3b2e8aad12d4b38829c717df7089957", "value": "100%" } }, "156d0e6b727e4a6abdcbafe252331d59": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "1efd6827a70d47b6b39930508f5f982a": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "2898b6d4428d46d7a90f6089929a61c6": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "description_width": "" } }, "5ba8ae955021496d8e71e74a1559d571": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HBoxModel", "state": { "children": [ "IPY_MODEL_ef53fd8aafb34edbb4df1d81d980710b", "IPY_MODEL_ee6c11e64a7c4c2aa6d20779b84143fe", "IPY_MODEL_affca5ea908742d288bda4c90c6670cd" ], "layout": "IPY_MODEL_be4c0ede9a584dd9a984a3b9fc1c2eeb" } }, "7c0e3d4996ea4bfe9e866e4f3ab1db51": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "7c94875645d846b0a8546763924d2dda": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "description_width": "" } }, "a1c206dcba7945d18f47957a6fc6d245": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "affca5ea908742d288bda4c90c6670cd": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "layout": "IPY_MODEL_1efd6827a70d47b6b39930508f5f982a", "style": "IPY_MODEL_0794ca95814a43c9a22560df196831cc", "value": " 610985/? [02:41<00:00, 2303.31it/s]" } }, "b79ba86fd0364b38b06fdd9d07acbf05": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "ProgressStyleModel", "state": { "description_width": "" } }, "bc347301928140c3ad0b1b8b634268b2": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "ProgressStyleModel", "state": { "description_width": "" } }, "be4c0ede9a584dd9a984a3b9fc1c2eeb": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "c3b2e8aad12d4b38829c717df7089957": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "description_width": "" } }, "dac59d08bfe2439890b1ddf0535fe30f": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HBoxModel", "state": { "children": [ "IPY_MODEL_14b5d5171fb84460b73a2e6f5e1765f1", "IPY_MODEL_1093e47402c24a46bdef6d3d550c181c", "IPY_MODEL_0bf61accd93e45f7be8553cfb43a2489" ], "layout": "IPY_MODEL_e2b5964dd51e4e268bc16ca1fc9ae6fb" } }, "e2b5964dd51e4e268bc16ca1fc9ae6fb": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "ee6c11e64a7c4c2aa6d20779b84143fe": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "FloatProgressModel", "state": { "bar_style": "success", "layout": "IPY_MODEL_ff9ae4eca4e841d2b4e5702dabbee25d", "max": 1, "style": "IPY_MODEL_b79ba86fd0364b38b06fdd9d07acbf05", "value": 1 } }, "ef53fd8aafb34edbb4df1d81d980710b": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "layout": "IPY_MODEL_156d0e6b727e4a6abdcbafe252331d59", "style": "IPY_MODEL_2898b6d4428d46d7a90f6089929a61c6" } }, "f4f305f52bd746ae9de5c5b71c4d46f8": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "ff9ae4eca4e841d2b4e5702dabbee25d": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "width": "20px" } } }, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }