{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Harvest GLAM datasets from government data portals\n", "\n", "Australian GLAM organisations have made a large number of openly-licensed datasets available through government data portals. But they're not always easy to find. Some are in state-based portals, other are in the national portal. And who would go looking for library data in a government data portal anyway?\n", "\n", "To encourage people to explore these datasets, I've harvested them all from the different portals and combined them into one big CSV file.\n", "\n", "## Method\n", "\n", "I've harvested data from the following portals:\n", "\n", "* [data.gov.au](https://data.gov.au/)\n", "* [data.nsw.gov.au](https://data.nsw.gov.au/)\n", "* [data.vic.gov.au](https://www.data.vic.gov.au/)\n", "* [data.sa.gov.au](https://data.sa.gov.au/)\n", "* [data.wa.gov.au](https://data.wa.gov.au/)\n", "* [data.qld.gov.au](https://www.data.qld.gov.au/)\n", "\n", "In actual fact [data.gov.au](https://data.gov.au/) provides two portals – an old one that includes datasets not in the state portals, and a new one that brings all the state and national datasets together. So why didn't I just harvest everything from the new data.gov.au portal? [I did](harvest_glam_datasets_from_datagovau.ipynb), but it soon became apparent that the new portal had a problem with managing duplicate organisations and datasets that made the results difficult to use. So now I've gone back to aggregating everything myself.\n", "\n", "For each portal, I've used the web interface to manually search for terms like 'library', 'archives', 'records', and 'museum' to find GLAM organisations. This isn't always straightforward. Sometimes the GLAM organisation will be identified as an 'organisation' by the data portal. But other times, the GLAM organisation is hidden beneath a parent organisation, and relevant datasets are identified by tags that include the GLAM organisation's name. In some cases there are neither organisations, or tags, and you just have to search for datasets that include the organisation name somewhere in their notes. Because of these inconsistencies, it's entirely possible that I've missed some organisations.\n", "\n", "I've saved all of the organisation names, tags, and queries, into the `portals` dictionary you'll see below, along with the API endpoint. Fortunately all of the portals use CKAN behind the scenes, so the API is consistent. Yay! This makes things so much easier. Unfortunately Victoria makes you register and get an API key before you can access their CKAN API, so if you want to run this harvest yourself, you'll have to insert your own API key where indicated. \n", "\n", "The datasets themselves are arranged in a hierarchy of packages and resources. A package can contain multiple resources, or files. These might be the same data in different formats, data files and documentation, or versions of the data that change over time. I flatten out this hierarchy as I harvest the packages to create a CSV file where each row is a single file. The fields I'm capturing are:\n", "\n", "* `dataset_title` – name of the package\n", "* `publisher` – organisation that created/published the package\n", "* `author` – usually an email of the person who uploaded the package\n", "* `dataset_issued` – date the package was created\n", "* `dataset_modified` – date the package was last changed\n", "* `dataset_description` – a description of the package\n", "* `source` – the portal it was harvested from\n", "* `info_url` – a link to the portal page for more information\n", "* `start_date` – earliest date in the data\n", "* `end_date` – latest date in the data\n", "* `file_title` – name of the file (resource)\n", "* `download_url` – url to directly download the data file\n", "* `format` – format of the file, eg. 'CSV' or 'JSON'\n", "* `file_description` – description of the file\n", "* `file_created` – date the file was created\n", "* `file_modified` – date the file was last changed\n", "* `file_size` – size of the file\n", "* `licence` – licence string, eg. 'CC-BY'\n", "\n", "You can browse a list of datasets, [download a CSV](https://github.com/GLAM-Workbench/ozglam-data/blob/master/glam-datasets-from-gov-portals.csv) containing all the harvested data, or [just the CSVs](https://github.com/GLAM-Workbench/ozglam-data/blob/master/glam-datasets-from-gov-portals-csvs.csv). You can also [search the harvested data](https://ozglam-datasets.glitch.me/data/glam-datasets-from-gov-portals) using Datasette on Glitch.\n", "\n", "To start exploring the *contents* of the datasets, give the [GLAM CSV Explorer](https://glam-workbench.github.io/csv-explorer/) a spin." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "import json\n", "import re\n", "import time\n", "from json import JSONDecodeError\n", "\n", "import pandas as pd\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from requests_cache import CachedSession\n", "from slugify import slugify\n", "from tqdm.notebook import tqdm\n", "\n", "s = CachedSession()\n", "s.headers.update(\n", " {\n", " \"User-Agent\": \"Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0\"\n", " }\n", ")\n", "retries = Retry(total=10, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "portals = [\n", " {\n", " \"name\": \"data.qld.gov.au\",\n", " \"api_url\": \"https://data.qld.gov.au/api/action/\",\n", " \"orgs\": [\"state-library-queensland\"],\n", " \"tags\": [\"Queensland State Archives\", \"queensland state archives\"],\n", " \"queries\": [\"Queensland Museum\"],\n", " \"groups\": [],\n", " \"base_url\": \"https://data.qld.gov.au/dataset/\",\n", " \"package_ids\": [],\n", " },\n", " {\n", " \"name\": \"data.gov.au\",\n", " \"api_url\": \"https://data.gov.au/api/3/action/\",\n", " \"orgs\": [\n", " \"aiatsis\",\n", " \"nationallibraryofaustralia\",\n", " \"libraries-tasmania\",\n", " \"nationalarchivesofaustralia\",\n", " \"national-portrait-gallery\",\n", " ],\n", " \"tags\": [],\n", " \"queries\": [],\n", " \"groups\": [],\n", " \"base_url\": \"https://data.gov.au/dataset/\",\n", " \"package_ids\": [],\n", " },\n", " {\n", " \"name\": \"data.sa.gov.au\",\n", " \"api_url\": \"https://data.sa.gov.au/data/api/3/action/\",\n", " \"orgs\": [\n", " \"state-library-of-south-australia\",\n", " \"mount-gambier-library\",\n", " \"state-records\",\n", " \"history-sa\",\n", " \"south-australian-museum\",\n", " ],\n", " \"tags\": [],\n", " \"queries\": [],\n", " \"groups\": [],\n", " \"base_url\": \"https://data.sa.gov.au/data/dataset/\",\n", " \"package_ids\": [],\n", " },\n", " {\n", " \"name\": \"data.nsw.gov.au\",\n", " \"api_url\": \"https://data.nsw.gov.au/data/api/3/action/\",\n", " \"orgs\": [\n", " \"state-library-of-nsw\",\n", " \"nsw-state-archives\",\n", " \"maas\",\n", " \"australian-museum\",\n", " ],\n", " \"tags\": [],\n", " \"queries\": [],\n", " \"groups\": [],\n", " \"base_url\": \"https://data.nsw.gov.au/data/dataset/\",\n", " \"package_ids\": [],\n", " },\n", " {\n", " \"name\": \"data.wa.gov.au\",\n", " \"api_url\": \"https://catalogue.data.wa.gov.au/api/3/action/\",\n", " \"orgs\": [\n", " \"state-library-of-western-australia\",\n", " \"state-records-office-of-western-australia\",\n", " \"western-australian-museum\",\n", " ],\n", " \"tags\": [],\n", " \"queries\": [],\n", " \"groups\": [],\n", " \"base_url\": \"https://catalogue.data.wa.gov.au/dataset/\",\n", " \"package_ids\": [],\n", " },\n", " {\n", " \"name\": \"data.vic.gov.au\",\n", " \"api_url\": \"https://discover.data.vic.gov.au/api/3/action/\",\n", " # 'apikey': 'YOUR API KEY',\n", " \"orgs\": [\"state-library-of-victoria\"],\n", " \"tags\": [],\n", " \"queries\": [\"PROV\", \"Public Records Office\", \"Museums Victoria\"],\n", " \"groups\": [],\n", " \"base_url\": \"https://www.data.vic.gov.au/data/dataset/\",\n", " \"package_ids\": [],\n", " },\n", "]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "def get_value(field):\n", " \"\"\"\n", " Sometimes values are strings and sometimes objects in strings.\n", " Get string values.\n", " \"\"\"\n", " try:\n", " s = field.replace(\"u'\", \"'\").replace(\"'\", '\"')\n", " j = json.loads(s)\n", " value = j[\"name\"]\n", " except JSONDecodeError:\n", " value = field\n", " except AttributeError:\n", " value = None\n", " return value\n", "\n", "\n", "def fix_github_links(url):\n", " \"\"\"\n", " Make sure github links point to downloadable files.\n", " \"\"\"\n", " return url.replace(\"//github.com\", \"//raw.githubusercontent.com\").replace(\n", " \"/blob\", \"\"\n", " )\n", "\n", "\n", "def check_http_status(url):\n", " \"\"\"\n", " Do a HEAD request of downloadable datasets to check if they're still there.\n", " \"\"\"\n", " response = s.head(url, allow_redirects=True)\n", " return response.status_code\n", "\n", "\n", "def get_format(resource):\n", " # First try getting file extension\n", " try:\n", " url = fix_github_links(resource[\"url\"])\n", " file_format = re.search(r\"\\.([a-zA-Z]+)$\", url).group(1).upper()\n", " # If that fails just use the supplied value (which may be dodgy)\n", " except AttributeError:\n", " file_format = resource[\"format\"]\n", " return file_format\n", "\n", "\n", "def add_key(portal):\n", " \"\"\"Add an API KEY into headers.\"\"\"\n", " if \"apikey\" in portal:\n", " headers = {\n", " \"apikey\": portal[\"apikey\"],\n", " \"Content-Type\": \"application/json\",\n", " \"Accept\": \"application/json\",\n", " }\n", " else:\n", " headers = {}\n", " return headers\n", "\n", "\n", "def get_package_resources(package_id, portal, org=None):\n", " \"\"\"\n", " Given a package id and a portal, download details of all associated datasets/\n", " \"\"\"\n", " resources = []\n", " api_url = portal[\"api_url\"]\n", " url = \"{}package_show?id={}\".format(api_url, package_id)\n", " # print(url)\n", " response = s.get(url, headers=add_key(portal))\n", " package_data = response.json()\n", " try:\n", " title = package_data[\"result\"][\"title\"]\n", " except KeyError:\n", " # Not found\n", " pass\n", " else:\n", " if org:\n", " organisation = org\n", " else:\n", " organisation = package_data[\"result\"][\"organization\"][\"title\"]\n", " try:\n", " author = get_value(package_data[\"result\"][\"author\"])\n", " except KeyError:\n", " author = None\n", " try:\n", " date_from = package_data[\"result\"][\"temporal_coverage_from\"]\n", " except KeyError:\n", " date_from = \"\"\n", " try:\n", " date_to = package_data[\"result\"][\"temporal_coverage_to\"]\n", " except KeyError:\n", " date_to = \"\"\n", " for resource in package_data[\"result\"][\"resources\"]:\n", " dataset = {}\n", " resource_url = fix_github_links(resource[\"url\"])\n", " dataset[\"dataset_title\"] = title.strip()\n", " dataset[\"publisher\"] = organisation\n", " if author:\n", " dataset[\"author\"] = author\n", " dataset[\"dataset_issued\"] = package_data[\"result\"][\"metadata_created\"]\n", " dataset[\"dataset_modified\"] = package_data[\"result\"][\"metadata_modified\"]\n", " dataset[\"dataset_description\"] = package_data[\"result\"][\"notes\"]\n", " dataset[\"source\"] = portal[\"name\"]\n", " dataset[\"info_url\"] = portal[\"base_url\"] + package_id\n", " dataset[\"start_date\"] = date_from\n", " dataset[\"end_date\"] = date_to\n", " dataset[\"file_title\"] = resource[\"name\"].strip()\n", " dataset[\"download_url\"] = resource_url\n", " dataset[\"format\"] = get_format(resource)\n", " dataset[\"file_description\"] = resource.get(\"description\")\n", " dataset[\"file_created\"] = resource[\"created\"]\n", " try:\n", " dataset[\"file_modified\"] = resource[\"last_modified\"]\n", " except KeyError:\n", " pass\n", " dataset[\"file_size\"] = resource[\"size\"]\n", " # dataset['status'] = check_http_status(resource_url)\n", " dataset[\"licence\"] = package_data[\"result\"][\"license_title\"]\n", " resources.append(dataset)\n", " return resources\n", "\n", "\n", "def process_packages(url, portal, results_label, org=None):\n", " \"\"\"\n", " Get list of packages associated with an organisation, or returned by a search,\n", " then get details of all the files (resources) inside that package.\n", " \"\"\"\n", " tqdm.write(url)\n", " resources = []\n", " response = s.get(url, headers=add_key(portal))\n", " data = response.json()\n", " for package in data[\"result\"][results_label]:\n", " resources.extend(get_package_resources(package[\"id\"], portal, org=org))\n", " time.sleep(0.2)\n", " return resources\n", "\n", "\n", "def process_portals():\n", " \"\"\"\n", " Get all of the resources from the defined portals.\n", " \"\"\"\n", " resources = []\n", " for portal in tqdm(portals):\n", " api_url = portal[\"api_url\"]\n", " for org in portal[\"orgs\"]:\n", " # url = f'{api_url}organization_show?id={org}&include_datasets=true'\n", " url = f\"{api_url}package_search?fq=organization:{org}&rows=1000\"\n", " # resources.extend(process_packages(url, portal, 'packages'))\n", " resources.extend(process_packages(url, portal, \"results\"))\n", " for tag in portal[\"tags\"]:\n", " url = f'{api_url}package_search?q=tags:\"{tag}\"&rows=1000'\n", " resources.extend(process_packages(url, portal, \"results\", org=tag))\n", " for query in portal[\"queries\"]:\n", " url = f'{api_url}package_search?q=\"{query}\"&rows=1000'\n", " resources.extend(process_packages(url, portal, \"results\", org=query))\n", " for group in portal[\"groups\"]:\n", " url = f\"{api_url}group_show?id={group}&include_datasets=True\"\n", " resources.extend(process_packages(url, portal, \"packages\"))\n", " return resources" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9fec6b64e5b0411ba67b2dade6f5df4a", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/6 [00:00