{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Harvest GLAM datasets from government data portals\n",
    "\n",
    "Australian GLAM organisations have made a large number of openly-licensed datasets available through government data portals. But they're not always easy to find. Some are in state-based portals, other are in the national portal. And who would go looking for library data in a government data portal anyway?\n",
    "\n",
    "To encourage people to explore these datasets, I've harvested them all from the different portals and combined them into one big CSV file.\n",
    "\n",
    "## Method\n",
    "\n",
    "I've harvested data from the following portals:\n",
    "\n",
    "* [data.gov.au](https://data.gov.au/)\n",
    "* [data.nsw.gov.au](https://data.nsw.gov.au/)\n",
    "* [data.vic.gov.au](https://www.data.vic.gov.au/)\n",
    "* [data.sa.gov.au](https://data.sa.gov.au/)\n",
    "* [data.wa.gov.au](https://data.wa.gov.au/)\n",
    "* [data.qld.gov.au](https://www.data.qld.gov.au/)\n",
    "\n",
    "In actual fact [data.gov.au](https://data.gov.au/) provides two portals – an old one that includes datasets not in the state portals, and a new one that brings all the state and national datasets together. So why didn't I just harvest everything from the new data.gov.au portal? [I did](harvest_glam_datasets_from_datagovau.ipynb), but it soon became apparent that the new portal had a problem with managing duplicate organisations and datasets that made the results difficult to use. So now I've gone back to aggregating everything myself.\n",
    "\n",
    "For each portal, I've used the web interface to manually search for terms like 'library', 'archives', 'records', and 'museum' to find GLAM organisations. This isn't always straightforward. Sometimes the GLAM organisation will be identified as an 'organisation' by the data portal. But other times, the GLAM organisation is hidden beneath a parent organisation, and relevant datasets are identified by tags that include the GLAM organisation's name. In some cases there are neither organisations, or tags, and you just have to search for datasets that include the organisation name somewhere in their notes. Because of these inconsistencies, it's entirely possible that I've missed some organisations.\n",
    "\n",
    "I've saved all of the organisation names, tags, and queries, into the `portals` dictionary you'll see below, along with the API endpoint. Fortunately all of the portals use CKAN behind the scenes, so the API is consistent. Yay! This makes things so much easier. Unfortunately Victoria makes you register and get an API key before you can access their CKAN API, so if you want to run this harvest yourself, you'll have to insert your own API key where indicated. \n",
    "\n",
    "The datasets themselves are arranged in a hierarchy of packages and resources. A package can contain multiple resources, or files. These might be the same data in different formats, data files and documentation, or versions of the data that change over time. I flatten out this hierarchy as I harvest the packages to create a CSV file where each row is a single file. The fields I'm capturing are:\n",
    "\n",
    "* `dataset_title` – name of the package\n",
    "* `publisher` – organisation that created/published the package\n",
    "* `author` – usually an email of the person who uploaded the package\n",
    "* `dataset_issued` – date the package was created\n",
    "* `dataset_modified` – date the package was last changed\n",
    "* `dataset_description` – a description of the package\n",
    "* `source` – the portal it was harvested from\n",
    "* `info_url` – a link to the portal page for more information\n",
    "* `start_date` – earliest date in the data\n",
    "* `end_date` – latest date in the data\n",
    "* `file_title` – name of the file (resource)\n",
    "* `download_url` – url to directly download the data file\n",
    "* `format` – format of the file, eg. 'CSV' or 'JSON'\n",
    "* `file_description` – description of the file\n",
    "* `file_created` – date the file was created\n",
    "* `file_modified` – date the file was last changed\n",
    "* `file_size` – size of the file\n",
    "* `licence` – licence string, eg. 'CC-BY'\n",
    "\n",
    "You can browse a list of datasets, [download a CSV](https://github.com/GLAM-Workbench/ozglam-data/blob/master/glam-datasets-from-gov-portals.csv) containing all the harvested data, or [just the CSVs](https://github.com/GLAM-Workbench/ozglam-data/blob/master/glam-datasets-from-gov-portals-csvs.csv). You can also [search the harvested data](https://ozglam-datasets.glitch.me/data/glam-datasets-from-gov-portals) using Datasette on Glitch.\n",
    "\n",
    "To start exploring the *contents* of the datasets, give the [GLAM CSV Explorer](https://glam-workbench.github.io/csv-explorer/) a spin."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import re\n",
    "import time\n",
    "from json import JSONDecodeError\n",
    "\n",
    "import pandas as pd\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "from requests_cache import CachedSession\n",
    "from slugify import slugify\n",
    "from tqdm.notebook import tqdm\n",
    "\n",
    "s = CachedSession()\n",
    "s.headers.update(\n",
    "    {\n",
    "        \"User-Agent\": \"Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0\"\n",
    "    }\n",
    ")\n",
    "retries = Retry(total=10, backoff_factor=1, status_forcelist=[502, 503, 504])\n",
    "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n",
    "s.mount(\"https://\", HTTPAdapter(max_retries=retries))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "portals = [\n",
    "    {\n",
    "        \"name\": \"data.qld.gov.au\",\n",
    "        \"api_url\": \"https://data.qld.gov.au/api/action/\",\n",
    "        \"orgs\": [\"state-library-queensland\"],\n",
    "        \"tags\": [\"Queensland State Archives\", \"queensland state archives\"],\n",
    "        \"queries\": [\"Queensland Museum\"],\n",
    "        \"groups\": [],\n",
    "        \"base_url\": \"https://data.qld.gov.au/dataset/\",\n",
    "        \"package_ids\": [],\n",
    "    },\n",
    "    {\n",
    "        \"name\": \"data.gov.au\",\n",
    "        \"api_url\": \"https://data.gov.au/api/3/action/\",\n",
    "        \"orgs\": [\n",
    "            \"aiatsis\",\n",
    "            \"nationallibraryofaustralia\",\n",
    "            \"libraries-tasmania\",\n",
    "            \"nationalarchivesofaustralia\",\n",
    "            \"national-portrait-gallery\",\n",
    "        ],\n",
    "        \"tags\": [],\n",
    "        \"queries\": [],\n",
    "        \"groups\": [],\n",
    "        \"base_url\": \"https://data.gov.au/dataset/\",\n",
    "        \"package_ids\": [],\n",
    "    },\n",
    "    {\n",
    "        \"name\": \"data.sa.gov.au\",\n",
    "        \"api_url\": \"https://data.sa.gov.au/data/api/3/action/\",\n",
    "        \"orgs\": [\n",
    "            \"state-library-of-south-australia\",\n",
    "            \"mount-gambier-library\",\n",
    "            \"state-records\",\n",
    "            \"history-sa\",\n",
    "            \"south-australian-museum\",\n",
    "        ],\n",
    "        \"tags\": [],\n",
    "        \"queries\": [],\n",
    "        \"groups\": [],\n",
    "        \"base_url\": \"https://data.sa.gov.au/data/dataset/\",\n",
    "        \"package_ids\": [],\n",
    "    },\n",
    "    {\n",
    "        \"name\": \"data.nsw.gov.au\",\n",
    "        \"api_url\": \"https://data.nsw.gov.au/data/api/3/action/\",\n",
    "        \"orgs\": [\n",
    "            \"state-library-of-nsw\",\n",
    "            \"nsw-state-archives\",\n",
    "            \"maas\",\n",
    "            \"australian-museum\",\n",
    "        ],\n",
    "        \"tags\": [],\n",
    "        \"queries\": [],\n",
    "        \"groups\": [],\n",
    "        \"base_url\": \"https://data.nsw.gov.au/data/dataset/\",\n",
    "        \"package_ids\": [],\n",
    "    },\n",
    "    {\n",
    "        \"name\": \"data.wa.gov.au\",\n",
    "        \"api_url\": \"https://catalogue.data.wa.gov.au/api/3/action/\",\n",
    "        \"orgs\": [\n",
    "            \"state-library-of-western-australia\",\n",
    "            \"state-records-office-of-western-australia\",\n",
    "            \"western-australian-museum\",\n",
    "        ],\n",
    "        \"tags\": [],\n",
    "        \"queries\": [],\n",
    "        \"groups\": [],\n",
    "        \"base_url\": \"https://catalogue.data.wa.gov.au/dataset/\",\n",
    "        \"package_ids\": [],\n",
    "    },\n",
    "    {\n",
    "        \"name\": \"data.vic.gov.au\",\n",
    "        \"api_url\": \"https://discover.data.vic.gov.au/api/3/action/\",\n",
    "        # 'apikey': 'YOUR API KEY',\n",
    "        \"orgs\": [\"state-library-of-victoria\"],\n",
    "        \"tags\": [],\n",
    "        \"queries\": [\"PROV\", \"Public Records Office\", \"Museums Victoria\"],\n",
    "        \"groups\": [],\n",
    "        \"base_url\": \"https://www.data.vic.gov.au/data/dataset/\",\n",
    "        \"package_ids\": [],\n",
    "    },\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_value(field):\n",
    "    \"\"\"\n",
    "    Sometimes values are strings and sometimes objects in strings.\n",
    "    Get string values.\n",
    "    \"\"\"\n",
    "    try:\n",
    "        s = field.replace(\"u'\", \"'\").replace(\"'\", '\"')\n",
    "        j = json.loads(s)\n",
    "        value = j[\"name\"]\n",
    "    except JSONDecodeError:\n",
    "        value = field\n",
    "    except AttributeError:\n",
    "        value = None\n",
    "    return value\n",
    "\n",
    "\n",
    "def fix_github_links(url):\n",
    "    \"\"\"\n",
    "    Make sure github links point to downloadable files.\n",
    "    \"\"\"\n",
    "    return url.replace(\"//github.com\", \"//raw.githubusercontent.com\").replace(\n",
    "        \"/blob\", \"\"\n",
    "    )\n",
    "\n",
    "\n",
    "def check_http_status(url):\n",
    "    \"\"\"\n",
    "    Do a HEAD request of downloadable datasets to check if they're still there.\n",
    "    \"\"\"\n",
    "    response = s.head(url, allow_redirects=True)\n",
    "    return response.status_code\n",
    "\n",
    "\n",
    "def get_format(resource):\n",
    "    # First try getting file extension\n",
    "    try:\n",
    "        url = fix_github_links(resource[\"url\"])\n",
    "        file_format = re.search(r\"\\.([a-zA-Z]+)$\", url).group(1).upper()\n",
    "    # If that fails just use the supplied value (which may be dodgy)\n",
    "    except AttributeError:\n",
    "        file_format = resource[\"format\"]\n",
    "    return file_format\n",
    "\n",
    "\n",
    "def add_key(portal):\n",
    "    \"\"\"Add an API KEY into headers.\"\"\"\n",
    "    if \"apikey\" in portal:\n",
    "        headers = {\n",
    "            \"apikey\": portal[\"apikey\"],\n",
    "            \"Content-Type\": \"application/json\",\n",
    "            \"Accept\": \"application/json\",\n",
    "        }\n",
    "    else:\n",
    "        headers = {}\n",
    "    return headers\n",
    "\n",
    "\n",
    "def get_package_resources(package_id, portal, org=None):\n",
    "    \"\"\"\n",
    "    Given a package id and a portal, download details of all associated datasets/\n",
    "    \"\"\"\n",
    "    resources = []\n",
    "    api_url = portal[\"api_url\"]\n",
    "    url = \"{}package_show?id={}\".format(api_url, package_id)\n",
    "    # print(url)\n",
    "    response = s.get(url, headers=add_key(portal))\n",
    "    package_data = response.json()\n",
    "    try:\n",
    "        title = package_data[\"result\"][\"title\"]\n",
    "    except KeyError:\n",
    "        # Not found\n",
    "        pass\n",
    "    else:\n",
    "        if org:\n",
    "            organisation = org\n",
    "        else:\n",
    "            organisation = package_data[\"result\"][\"organization\"][\"title\"]\n",
    "        try:\n",
    "            author = get_value(package_data[\"result\"][\"author\"])\n",
    "        except KeyError:\n",
    "            author = None\n",
    "        try:\n",
    "            date_from = package_data[\"result\"][\"temporal_coverage_from\"]\n",
    "        except KeyError:\n",
    "            date_from = \"\"\n",
    "        try:\n",
    "            date_to = package_data[\"result\"][\"temporal_coverage_to\"]\n",
    "        except KeyError:\n",
    "            date_to = \"\"\n",
    "        for resource in package_data[\"result\"][\"resources\"]:\n",
    "            dataset = {}\n",
    "            resource_url = fix_github_links(resource[\"url\"])\n",
    "            dataset[\"dataset_title\"] = title.strip()\n",
    "            dataset[\"publisher\"] = organisation\n",
    "            if author:\n",
    "                dataset[\"author\"] = author\n",
    "            dataset[\"dataset_issued\"] = package_data[\"result\"][\"metadata_created\"]\n",
    "            dataset[\"dataset_modified\"] = package_data[\"result\"][\"metadata_modified\"]\n",
    "            dataset[\"dataset_description\"] = package_data[\"result\"][\"notes\"]\n",
    "            dataset[\"source\"] = portal[\"name\"]\n",
    "            dataset[\"info_url\"] = portal[\"base_url\"] + package_id\n",
    "            dataset[\"start_date\"] = date_from\n",
    "            dataset[\"end_date\"] = date_to\n",
    "            dataset[\"file_title\"] = resource[\"name\"].strip()\n",
    "            dataset[\"download_url\"] = resource_url\n",
    "            dataset[\"format\"] = get_format(resource)\n",
    "            dataset[\"file_description\"] = resource.get(\"description\")\n",
    "            dataset[\"file_created\"] = resource[\"created\"]\n",
    "            try:\n",
    "                dataset[\"file_modified\"] = resource[\"last_modified\"]\n",
    "            except KeyError:\n",
    "                pass\n",
    "            dataset[\"file_size\"] = resource[\"size\"]\n",
    "            # dataset['status'] = check_http_status(resource_url)\n",
    "            dataset[\"licence\"] = package_data[\"result\"][\"license_title\"]\n",
    "            resources.append(dataset)\n",
    "    return resources\n",
    "\n",
    "\n",
    "def process_packages(url, portal, results_label, org=None):\n",
    "    \"\"\"\n",
    "    Get list of packages associated with an organisation, or returned by a search,\n",
    "    then get details of all the files (resources) inside that package.\n",
    "    \"\"\"\n",
    "    tqdm.write(url)\n",
    "    resources = []\n",
    "    response = s.get(url, headers=add_key(portal))\n",
    "    data = response.json()\n",
    "    for package in data[\"result\"][results_label]:\n",
    "        resources.extend(get_package_resources(package[\"id\"], portal, org=org))\n",
    "        time.sleep(0.2)\n",
    "    return resources\n",
    "\n",
    "\n",
    "def process_portals():\n",
    "    \"\"\"\n",
    "    Get all of the resources from the defined portals.\n",
    "    \"\"\"\n",
    "    resources = []\n",
    "    for portal in tqdm(portals):\n",
    "        api_url = portal[\"api_url\"]\n",
    "        for org in portal[\"orgs\"]:\n",
    "            # url = f'{api_url}organization_show?id={org}&include_datasets=true'\n",
    "            url = f\"{api_url}package_search?fq=organization:{org}&rows=1000\"\n",
    "            # resources.extend(process_packages(url, portal, 'packages'))\n",
    "            resources.extend(process_packages(url, portal, \"results\"))\n",
    "        for tag in portal[\"tags\"]:\n",
    "            url = f'{api_url}package_search?q=tags:\"{tag}\"&rows=1000'\n",
    "            resources.extend(process_packages(url, portal, \"results\", org=tag))\n",
    "        for query in portal[\"queries\"]:\n",
    "            url = f'{api_url}package_search?q=\"{query}\"&rows=1000'\n",
    "            resources.extend(process_packages(url, portal, \"results\", org=query))\n",
    "        for group in portal[\"groups\"]:\n",
    "            url = f\"{api_url}group_show?id={group}&include_datasets=True\"\n",
    "            resources.extend(process_packages(url, portal, \"packages\"))\n",
    "    return resources"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "9fec6b64e5b0411ba67b2dade6f5df4a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/6 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://data.qld.gov.au/api/action/package_search?fq=organization:state-library-queensland&rows=1000\n",
      "https://data.qld.gov.au/api/action/package_search?q=tags:\"Queensland State Archives\"&rows=1000\n",
      "https://data.qld.gov.au/api/action/package_search?q=tags:\"queensland state archives\"&rows=1000\n",
      "https://data.qld.gov.au/api/action/package_search?q=\"Queensland Museum\"&rows=1000\n",
      "https://data.gov.au/api/3/action/package_search?fq=organization:aiatsis&rows=1000\n",
      "https://data.gov.au/api/3/action/package_search?fq=organization:nationallibraryofaustralia&rows=1000\n",
      "https://data.gov.au/api/3/action/package_search?fq=organization:libraries-tasmania&rows=1000\n",
      "https://data.gov.au/api/3/action/package_search?fq=organization:nationalarchivesofaustralia&rows=1000\n",
      "https://data.gov.au/api/3/action/package_search?fq=organization:national-portrait-gallery&rows=1000\n",
      "https://data.sa.gov.au/data/api/3/action/package_search?fq=organization:state-library-of-south-australia&rows=1000\n",
      "https://data.sa.gov.au/data/api/3/action/package_search?fq=organization:mount-gambier-library&rows=1000\n",
      "https://data.sa.gov.au/data/api/3/action/package_search?fq=organization:state-records&rows=1000\n",
      "https://data.sa.gov.au/data/api/3/action/package_search?fq=organization:history-sa&rows=1000\n",
      "https://data.sa.gov.au/data/api/3/action/package_search?fq=organization:south-australian-museum&rows=1000\n",
      "https://data.nsw.gov.au/data/api/3/action/package_search?fq=organization:state-library-of-nsw&rows=1000\n",
      "https://data.nsw.gov.au/data/api/3/action/package_search?fq=organization:nsw-state-archives&rows=1000\n",
      "https://data.nsw.gov.au/data/api/3/action/package_search?fq=organization:maas&rows=1000\n",
      "https://data.nsw.gov.au/data/api/3/action/package_search?fq=organization:australian-museum&rows=1000\n",
      "https://catalogue.data.wa.gov.au/api/3/action/package_search?fq=organization:state-library-of-western-australia&rows=1000\n",
      "https://catalogue.data.wa.gov.au/api/3/action/package_search?fq=organization:state-records-office-of-western-australia&rows=1000\n",
      "https://catalogue.data.wa.gov.au/api/3/action/package_search?fq=organization:western-australian-museum&rows=1000\n",
      "https://discover.data.vic.gov.au/api/3/action/package_search?fq=organization:state-library-of-victoria&rows=1000\n",
      "https://discover.data.vic.gov.au/api/3/action/package_search?q=\"PROV\"&rows=1000\n",
      "https://discover.data.vic.gov.au/api/3/action/package_search?q=\"Public Records Office\"&rows=1000\n",
      "https://discover.data.vic.gov.au/api/3/action/package_search?q=\"Museums Victoria\"&rows=1000\n"
     ]
    }
   ],
   "source": [
    "resources = process_portals()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(resources)\n",
    "\n",
    "# Standardise some names\n",
    "df[\"publisher\"] = df[\"publisher\"].str.replace(\n",
    "    \"queensland state archives\", \"Queensland State Archives\"\n",
    ")\n",
    "df[\"publisher\"] = df[\"publisher\"].str.replace(\n",
    "    \"Public Records Office\", \"Public Records Office Victoria\"\n",
    ")\n",
    "df[\"publisher\"] = df[\"publisher\"].str.replace(\n",
    "    \"Public Record Office\", \"Public Records Office Victoria\"\n",
    ")\n",
    "df[\"publisher\"] = df[\"publisher\"].str.replace(\"PROV\", \"Public Records Office Victoria\")\n",
    "df[\"publisher\"] = df[\"publisher\"].str.replace(\n",
    "    r\"^State Records$\", \"State Records South Australia\", regex=True\n",
    ")\n",
    "\n",
    "# Just in case there are any duplicates, we can use the download_url (which incorporates a unique id) to get rid of them\n",
    "df.drop_duplicates(subset=[\"download_url\"], inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How many files are there?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1192"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How many files come from each portal?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "data.qld.gov.au    559\n",
       "data.sa.gov.au     262\n",
       "data.wa.gov.au     177\n",
       "data.gov.au         98\n",
       "data.vic.gov.au     68\n",
       "data.nsw.gov.au     28\n",
       "Name: source, dtype: int64"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"source\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How many unique datasets are there?\n",
    "\n",
    "Remember a single dataset might contain multiple files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "463"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_datasets = df.drop_duplicates(subset=[\"info_url\"])\n",
    "df_datasets.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How many datasets come from each portal?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "data.qld.gov.au    175\n",
       "data.sa.gov.au     102\n",
       "data.vic.gov.au     67\n",
       "data.wa.gov.au      65\n",
       "data.gov.au         38\n",
       "data.nsw.gov.au     16\n",
       "Name: source, dtype: int64"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_datasets[\"source\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How many datasets and files come from each organisation?\n",
    "\n",
    "First let's look at datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Queensland State Archives                                                          108\n",
       "Public Records Office Victoria                                                      61\n",
       "State Records Office of Western Australia                                           44\n",
       "State Library of South Australia                                                    37\n",
       "State Library of Queensland                                                         36\n",
       "Queensland Museum                                                                   31\n",
       "State Records South Australia                                                       30\n",
       "Libraries Tasmania                                                                  29\n",
       "State Library of Western Australia                                                  18\n",
       "South Australian Museum                                                             17\n",
       "History Trust of South Australia                                                    16\n",
       "NSW State Archives                                                                   8\n",
       "State Library of NSW                                                                 5\n",
       "State Library of Victoria                                                            5\n",
       "National Library of Australia                                                        4\n",
       "Museum of Applied Arts and Sciences                                                  3\n",
       "Western Australian Museum                                                            3\n",
       "Mount Gambier Library                                                                2\n",
       "National Archives of Australia                                                       2\n",
       "Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS)      2\n",
       "National Portrait Gallery                                                            1\n",
       "Museums Victoria                                                                     1\n",
       "Name: publisher, dtype: int64"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_datasets[\"publisher\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's look at files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "State Library of Queensland                                                        304\n",
       "Queensland State Archives                                                          204\n",
       "State Library of South Australia                                                   136\n",
       "State Records Office of Western Australia                                           86\n",
       "Libraries Tasmania                                                                  85\n",
       "South Australian Museum                                                             74\n",
       "State Library of Western Australia                                                  73\n",
       "Public Records Office Victoria                                                      62\n",
       "Queensland Museum                                                                   51\n",
       "State Records South Australia                                                       30\n",
       "History Trust of South Australia                                                    20\n",
       "NSW State Archives                                                                  19\n",
       "Western Australian Museum                                                           18\n",
       "State Library of NSW                                                                 5\n",
       "National Library of Australia                                                        5\n",
       "State Library of Victoria                                                            5\n",
       "Museum of Applied Arts and Sciences                                                  4\n",
       "National Archives of Australia                                                       3\n",
       "Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS)      3\n",
       "Mount Gambier Library                                                                2\n",
       "National Portrait Gallery                                                            2\n",
       "Museums Victoria                                                                     1\n",
       "Name: publisher, dtype: int64"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"publisher\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What formats are the files in?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CSV                              752\n",
       "JSON                              78\n",
       "XML                               61\n",
       "JPG                               53\n",
       "XLSX                              51\n",
       "IMG                               37\n",
       "RTF                               16\n",
       "JPEG                              15\n",
       "HTML                              12\n",
       "website link                      10\n",
       "DOC                               10\n",
       "API                               10\n",
       "ZIP                               10\n",
       "DOCX                               9\n",
       "TXT                                9\n",
       "GeoJSON                            7\n",
       "OBJ                                7\n",
       "MTL                                5\n",
       "API ArcGIS Server Map Service      4\n",
       "Mixed Formats                      4\n",
       "GeoPackage                         3\n",
       "SHP                                3\n",
       "FGDB                               3\n",
       "KML                                2\n",
       "WMS                                2\n",
       "PDF                                2\n",
       "                                   2\n",
       "GEOJSON                            2\n",
       "Website                            1\n",
       "TIF                                1\n",
       "app                                1\n",
       "JS                                 1\n",
       "MUSEUM                             1\n",
       "api                                1\n",
       "RSS                                1\n",
       "CSS, Java, PHP, JavaScript         1\n",
       ".txt                               1\n",
       "XSD                                1\n",
       "plain                              1\n",
       "page                               1\n",
       "WFS                                1\n",
       "Name: format, dtype: int64"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"format\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What licences have been applied to the files?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Creative Commons Attribution 4.0                                      646\n",
       "Creative Commons Attribution                                          277\n",
       "Creative Commons Attribution 4.0 International                        127\n",
       "                                                                       77\n",
       "Creative Commons Attribution 2.5 Australia                             27\n",
       "Creative Commons Attribution-NonCommercial                              9\n",
       "Creative Commons Attribution 3.0 Australia                              8\n",
       "Creative Commons Attribution 3.0                                        7\n",
       "Other (Open)                                                            4\n",
       "notspecified                                                            3\n",
       "Creative Commons Attribution Share-Alike 4.0                            3\n",
       "Creative Commons Attribution Non-Commercial 4.0                         2\n",
       "Custom (Other)                                                          1\n",
       "Creative Commons Attribution No Derivative Works 4.0 International      1\n",
       "Name: licence, dtype: int64"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"licence\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save as a CSV file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_csv(\"glam-datasets-from-gov-portals.csv\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save a CSV of CSV files only!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "csvs = df.loc[df[\"format\"] == \"CSV\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(752, 18)"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "csvs.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "csvs.to_csv(\"glam-datasets-from-gov-portals-csvs.csv\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a human-readable list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write results to a markdown file\n",
    "\n",
    "orgs = df.sort_values(by=[\"publisher\", \"dataset_title\", \"dataset_modified\"]).groupby(\n",
    "    \"publisher\"\n",
    ")\n",
    "with open(\"glam_datasets_from_gov_portals.md\", \"w\") as md_file:\n",
    "    md_file.write(\"# GLAM datasets from Australian government data portals\\n\")\n",
    "    for org, group in orgs:\n",
    "        md_file.write(\"* [{}](#{})\\n\".format(org, slugify(org)))\n",
    "    for org, group in orgs:\n",
    "        md_file.write(\"\\n## {}\\n\".format(org))\n",
    "        for dataset, files in group.groupby([\"dataset_title\", \"info_url\"]):\n",
    "            md_file.write(\"\\n### [{}]({})\\n\".format(dataset[0], dataset[1]))\n",
    "            for row in files.itertuples():\n",
    "                if row.file_modified:\n",
    "                    file_date = row.file_modified\n",
    "                else:\n",
    "                    file_date = row.file_created\n",
    "                md_file.write(\n",
    "                    \"* [{}]({}) ({}, {})\\n\".format(\n",
    "                        row.file_title,\n",
    "                        row.download_url.replace(\" \", \"+\"),\n",
    "                        row.format,\n",
    "                        file_date,\n",
    "                    )\n",
    "                )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org) ([@wragge](https://twitter.com/wragge)) for the [GLAM workbench](https://glam-workbench.github.io/). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  },
  "vscode": {
   "interpreter": {
    "hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a"
   }
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {
     "0c8c13a6b7ec4088a6bdecc1487d558f": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "1.5.0",
      "model_name": "DescriptionStyleModel",
      "state": {
       "description_width": ""
      }
     },
     "0dd1f501030b4cae860c79dd5572fc52": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "1.2.0",
      "model_name": "LayoutModel",
      "state": {}
     },
     "2b4db27bc61246139ae719660b730bd5": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "1.2.0",
      "model_name": "LayoutModel",
      "state": {}
     },
     "2d6c479af2b44e219d9db525eb8ff108": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "1.2.0",
      "model_name": "LayoutModel",
      "state": {}
     },
     "380e51ba6fa84b8caccefb37b2e481dc": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "1.5.0",
      "model_name": "ProgressStyleModel",
      "state": {
       "description_width": ""
      }
     },
     "3e56aa21f7b34bc8b222371bd23f665b": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "1.5.0",
      "model_name": "HTMLModel",
      "state": {
       "layout": "IPY_MODEL_2d6c479af2b44e219d9db525eb8ff108",
       "style": "IPY_MODEL_979c87645ac948e5a8c6ff6fb5e6ffde",
       "value": "100%"
      }
     },
     "3eaed50ee2d74d3b88fd33cce8af7bad": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "1.5.0",
      "model_name": "DescriptionStyleModel",
      "state": {
       "description_width": ""
      }
     },
     "433ec6bfc2a64d97912d9657ee13ef2c": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "1.5.0",
      "model_name": "HBoxModel",
      "state": {
       "children": [
        "IPY_MODEL_3e56aa21f7b34bc8b222371bd23f665b",
        "IPY_MODEL_d1f7eee58b514741a03b10d0e13fd0e5",
        "IPY_MODEL_9dc1672472bc4db296b9a89a524769cb"
       ],
       "layout": "IPY_MODEL_89cdb32348fb4c36836f45bf3ef3f56a"
      }
     },
     "89cdb32348fb4c36836f45bf3ef3f56a": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "1.2.0",
      "model_name": "LayoutModel",
      "state": {}
     },
     "90e65a692fd54901b3cdd9e2a501787d": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "1.2.0",
      "model_name": "LayoutModel",
      "state": {}
     },
     "979c87645ac948e5a8c6ff6fb5e6ffde": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "1.5.0",
      "model_name": "DescriptionStyleModel",
      "state": {
       "description_width": ""
      }
     },
     "9c720f1f976846b79d20ee283b1519fa": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "1.5.0",
      "model_name": "HTMLModel",
      "state": {
       "layout": "IPY_MODEL_0dd1f501030b4cae860c79dd5572fc52",
       "style": "IPY_MODEL_b8201739a3a3450995b4a4f99fff151c",
       "value": "100%"
      }
     },
     "9dc1672472bc4db296b9a89a524769cb": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "1.5.0",
      "model_name": "HTMLModel",
      "state": {
       "layout": "IPY_MODEL_2b4db27bc61246139ae719660b730bd5",
       "style": "IPY_MODEL_0c8c13a6b7ec4088a6bdecc1487d558f",
       "value": " 6/6 [02:46&lt;00:00, 24.95s/it]"
      }
     },
     "9fec6b64e5b0411ba67b2dade6f5df4a": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "1.5.0",
      "model_name": "HBoxModel",
      "state": {
       "children": [
        "IPY_MODEL_9c720f1f976846b79d20ee283b1519fa",
        "IPY_MODEL_a13580f24a1b449887ed540998e6a414",
        "IPY_MODEL_d4b87b7ba6b247f39ee25b0b5cfb50bd"
       ],
       "layout": "IPY_MODEL_b0b93c79d50b4b719e3887f3955a4b46"
      }
     },
     "a13580f24a1b449887ed540998e6a414": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "1.5.0",
      "model_name": "FloatProgressModel",
      "state": {
       "bar_style": "success",
       "layout": "IPY_MODEL_e2ac9b305937469b898088e2bf6ee2c6",
       "max": 6,
       "style": "IPY_MODEL_f46716c102e34604a69b5b19b197703b",
       "value": 6
      }
     },
     "b0b93c79d50b4b719e3887f3955a4b46": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "1.2.0",
      "model_name": "LayoutModel",
      "state": {}
     },
     "b8201739a3a3450995b4a4f99fff151c": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "1.5.0",
      "model_name": "DescriptionStyleModel",
      "state": {
       "description_width": ""
      }
     },
     "c78ac1b056e54fa5945d20a709f6528f": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "1.2.0",
      "model_name": "LayoutModel",
      "state": {}
     },
     "d1f7eee58b514741a03b10d0e13fd0e5": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "1.5.0",
      "model_name": "FloatProgressModel",
      "state": {
       "bar_style": "success",
       "layout": "IPY_MODEL_90e65a692fd54901b3cdd9e2a501787d",
       "max": 6,
       "style": "IPY_MODEL_380e51ba6fa84b8caccefb37b2e481dc",
       "value": 6
      }
     },
     "d4b87b7ba6b247f39ee25b0b5cfb50bd": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "1.5.0",
      "model_name": "HTMLModel",
      "state": {
       "layout": "IPY_MODEL_c78ac1b056e54fa5945d20a709f6528f",
       "style": "IPY_MODEL_3eaed50ee2d74d3b88fd33cce8af7bad",
       "value": " 6/6 [01:45&lt;00:00, 16.88s/it]"
      }
     },
     "e2ac9b305937469b898088e2bf6ee2c6": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "1.2.0",
      "model_name": "LayoutModel",
      "state": {}
     },
     "f46716c102e34604a69b5b19b197703b": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "1.5.0",
      "model_name": "ProgressStyleModel",
      "state": {
       "description_width": ""
      }
     }
    },
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}