{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# The use of standard licences and rights statements in Trove image records\n", "\n", "Version [2.1 of the Trove API](https://trove.nla.gov.au/sites/default/files/2020-01/API%20V2.1%20-%20Whats%20Changed%20-%20Full%20Text%20Search%20Release.pdf) introduced a new `rights` index that you can use to limit your search results to records that include one of the licences and rights statements [listed on this page](https://trove.nla.gov.au/partners/partner-services/adding-collections-trove/enrich-your-data/licensing-and-re-use). We can also use this index to build a picture of which rights statements are currently being used, and by who. Let's give it a try...\n", "\n", "The method used here is to:\n", "\n", "* Retrieve details of Trove contributors from the API\n", "* Loop through the contributors, then loop through all the licences/rights statements, firing off a search in the `picture` zone for each combination.\n", "* Save the total results for each query with the contributor and licence details.\n", "\n", "So for every organisation that contributes records to Trove, we'll find out the number of image records that include each rights statement. \n", "\n", "Problems:\n", "\n", "* Searching by contributor saves us having to harvest **all** the images, but it has a major problem. Sometimes Trove will group multiple versions of a picture held by different organisations as a single work. Rights information is saved in the version metadata, but searches only return works. So if one organisation has assigned a rights statement to a version of the image, it will look like all the organisations whose images are grouped together with it as a work are using that rights statement. I don't think this will make a huge difference to the results, but it will be something to look out for. The only way around this is to harvest everything and expand the versions out into separate record.\n", "\n", "* The `rights` index doesn't currently seem to include information on out of copyright images, unless they've actually been marked using the 'Public Domain' statement by the institution. Common statements such as 'Out of copyright', 'No known copyright restrictions', or 'Copyright expired' return no results. So there's a lot more open images than are currently reported by the rights index.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import os\n", "\n", "import pandas as pd\n", "import requests_cache\n", "from dotenv import load_dotenv\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.notebook import tqdm\n", "\n", "# Create a session that will automatically retry on server errors\n", "s = requests_cache.CachedSession()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "\n", "load_dotenv()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# These are all the licence/rights statements recognised by Trove\n", "# Copied from https://help.nla.gov.au/trove/becoming-partner/for-content-partners/licensing-reuse\n", "licences = [\n", " \"Free/CC Public Domain\",\n", " \"Free/CC BY\",\n", " \"Free/CC0\",\n", " \"Free/RS NKC\",\n", " \"Free/RS Noc-US\",\n", " \"Free with conditions/CC BY-ND\",\n", " \"Free with conditions/CC BY-SA\",\n", " \"Free with conditions/CC BY-NC\",\n", " \"Free with conditions/CC BY-NC-ND\",\n", " \"Free with conditions/CC BY-NC-SA\",\n", " \"Free with conditions/RS NoC-NC\",\n", " \"Free with conditions/InC-NC\",\n", " \"Free with conditions/InC-EDU\",\n", " \"Restricted/RS InC\",\n", " \"Restricted/RS InC-OW-EU\",\n", " \"Restricted/RS InC-RUU\",\n", " \"Restricted/RS CNE\",\n", " \"Restricted/RS UND\",\n", " \"Restricted/NoC-CR\",\n", " \"Restricted/NoC-OKLR\",\n", "]" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# Insert your Trove API key\n", "API_KEY = \"YOUR API KEY\"\n", "\n", "# Use api key value from environment variables if it is available\n", "if os.getenv(\"TROVE_API_KEY\"):\n", " API_KEY = os.getenv(\"TROVE_API_KEY\")\n", "\n", "HEADERS = {\"X-API-KEY\": API_KEY}" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def save_summary(contributors, record, parent=None):\n", " \"\"\"\n", " Extract basic data from contributor record, and traverse any child records.\n", " Create a full_name value by combining parent and child names.\n", " \"\"\"\n", " summary = {\"id\": record[\"id\"], \"name\": record[\"name\"]}\n", " if parent:\n", " summary[\"parent_id\"] = parent[\"id\"]\n", " summary[\"full_name\"] = f'{parent[\"full_name\"]} / {record[\"name\"]}'\n", " elif \"parent\" in record:\n", " summary[\"parent_id\"] = record[\"parent\"][\"id\"]\n", " summary[\"full_name\"] = f'{record[\"parent\"][\"name\"]} / {record[\"name\"]}'\n", " else:\n", " summary[\"full_name\"] = record[\"name\"]\n", " if \"children\" in record:\n", " for child in record[\"children\"]:\n", " save_summary(contributors, child, summary)\n", " contributors.append(summary)\n", "\n", "\n", "def get_contributors():\n", " \"\"\"\n", " Get a list of contributors form the Trove API.\n", " Flatten all the nested records.\n", " \"\"\"\n", " contributors = []\n", " contrib_params = {\"key\": API_KEY, \"encoding\": \"json\", \"reclevel\": \"full\"}\n", " response = s.get(\n", " \"https://api.trove.nla.gov.au/v3/contributor/\",\n", " params=contrib_params,\n", " headers=HEADERS,\n", " timeout=60,\n", " )\n", " data = response.json()\n", " for record in data[\"contributor\"]:\n", " save_summary(contributors, record)\n", " return contributors" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def contributor_has_results(contrib, params, additional_query):\n", " \"\"\"\n", " Check to see is the query return any results for this contributor.\n", " \"\"\"\n", " query = f'nuc:\"{contrib[\"id\"]}\"'\n", " # Add any extra queries\n", " if additional_query:\n", " query += f\" {additional_query}\"\n", " params[\"q\"] = query\n", " response = s.get(\n", " \"https://api.trove.nla.gov.au/v3/result\",\n", " params=params,\n", " headers=HEADERS,\n", " timeout=60,\n", " )\n", "\n", " data = response.json()\n", " total = int(data[\"category\"][0][\"records\"][\"total\"])\n", " if total > 0:\n", " return True\n", "\n", "\n", "def licence_counts_by_institution(additional_query=None):\n", " \"\"\"\n", " Loop through contributors and licences to harvest data about the number of times each licence is used.\n", " \"\"\"\n", " contributors = get_contributors()\n", " licence_counts = []\n", " params = {\"encoding\": \"json\", \"category\": \"image\", \"n\": 0}\n", "\n", " for contrib in tqdm(contributors):\n", " # If there are no results for this contributor then there's no point checking for licences\n", " # This should save a bit of time\n", " if contributor_has_results(contrib, params, additional_query):\n", " contrib_row = contrib.copy()\n", " # Only search for nuc ids that start with a letter\n", " if contrib[\"id\"][0].isalpha():\n", " for licence in licences:\n", " # Construct query using nuc id and licence\n", " query = f'nuc:\"{contrib[\"id\"]}\" rights:\"{licence}\"'\n", " # Add any extra queries\n", " if additional_query:\n", " query += f\" {additional_query}\"\n", " params[\"q\"] = query\n", " response = s.get(\n", " \"https://api.trove.nla.gov.au/v3/result\",\n", " params=params,\n", " headers=HEADERS,\n", " timeout=60,\n", " )\n", " data = response.json()\n", " total = data[\"category\"][0][\"records\"][\"total\"]\n", " contrib_row[licence] = int(total)\n", " # print(contrib_row)\n", " licence_counts.append(contrib_row)\n", " return licence_counts" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "licence_counts_not_books = licence_counts_by_institution('NOT format:\"Book\"')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Process the data" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df = pd.DataFrame(licence_counts_not_books)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Fill empty totals with zeros & make them all integers\n", "df[licences] = df[licences].fillna(0).astype(int)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [ { "data": { "text/plain": [ "Free/CC Public Domain 308691\n", "Free/CC BY 171779\n", "Free/CC0 2130\n", "Free/RS NKC 5892\n", "Free/RS Noc-US 0\n", "Free with conditions/CC BY-ND 0\n", "Free with conditions/CC BY-SA 13045\n", "Free with conditions/CC BY-NC 23991\n", "Free with conditions/CC BY-NC-ND 25022\n", "Free with conditions/CC BY-NC-SA 125873\n", "Free with conditions/RS NoC-NC 0\n", "Free with conditions/InC-NC 0\n", "Free with conditions/InC-EDU 4639\n", "Restricted/RS InC 14613\n", "Restricted/RS InC-OW-EU 0\n", "Restricted/RS InC-RUU 1\n", "Restricted/RS CNE 12868\n", "Restricted/RS UND 415\n", "Restricted/NoC-CR 0\n", "Restricted/NoC-OKLR 0\n", "dtype: int64" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the overall distribution of rights statements\n", "df.sum(numeric_only=True)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Remove columns we don't need\n", "df_final = df[[\"id\", \"full_name\"] + licences]" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Remove rows that add up to zero\n", "df_final = df_final.loc[(df_final.sum(axis=1, numeric_only=True) != 0)]" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Remove columns that are all zero\n", "df_final = df_final.loc[:, df_final.any()]" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Sort by name and save as CSV\n", "df_final.sort_values(by=[\"full_name\"]).to_csv(\"rights-on-images.csv\", index=False)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Are there any licences applied to out-of-copyright images?\n", "\n", "Some GLAM institutions apply restrictive licences to digitised versions of out-of-copyright images. Under Australian copyright law, photographs created before 1955 are out of copyright, so we can adjust our query and look to see what sorts of rights statements are attached to them." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "licence_counts_out_of_copyright = licence_counts_by_institution(\n", " \"format:Photograph date:[* TO 1954]\"\n", ")" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df2 = pd.DataFrame(licence_counts_out_of_copyright)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Fill empty totals with zeros & make them all integers\n", "df2[licences] = df2[licences].fillna(0).astype(int)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [ { "data": { "text/plain": [ "Free/CC Public Domain 30088\n", "Free/CC BY 15017\n", "Free/CC0 653\n", "Free/RS NKC 1537\n", "Free/RS Noc-US 0\n", "Free with conditions/CC BY-ND 0\n", "Free with conditions/CC BY-SA 934\n", "Free with conditions/CC BY-NC 84\n", "Free with conditions/CC BY-NC-ND 829\n", "Free with conditions/CC BY-NC-SA 1412\n", "Free with conditions/RS NoC-NC 0\n", "Free with conditions/InC-NC 0\n", "Free with conditions/InC-EDU 2\n", "Restricted/RS InC 128\n", "Restricted/RS InC-OW-EU 0\n", "Restricted/RS InC-RUU 0\n", "Restricted/RS CNE 572\n", "Restricted/RS UND 2\n", "Restricted/NoC-CR 0\n", "Restricted/NoC-OKLR 0\n", "dtype: int64" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the overall distribution of rights statements\n", "df2.sum(numeric_only=True)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Remove columns we don't need\n", "df2_final = df2[[\"id\", \"full_name\"] + licences]" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Remove rows that add up to zero\n", "df2_final = df2_final.loc[(df2_final.sum(axis=1, numeric_only=True) != 0)]" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Remove columns that are all zero\n", "df2_final = df2_final.loc[:, df2_final.any()]" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Sort by name and save as CSV\n", "df2_final.sort_values(by=[\"full_name\"]).to_csv(\n", " \"rights-on-out-of-copyright-photos.csv\", index=False\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "jupyter": { "source_hidden": true }, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# IGNORE THIS CELL -- FOR TESTING ONLY\n", "\n", "if os.getenv(\"GW_STATUS\") == \"dev\":\n", "\n", " def get_contributors_sample():\n", " \"\"\"\n", " Get a sample of contributors from the Trove API for testing.\n", " Flatten all the nested records.\n", " \"\"\"\n", " contributors = []\n", " contrib_params = {\"key\": API_KEY, \"encoding\": \"json\", \"reclevel\": \"full\"}\n", " response = s.get(\n", " \"https://api.trove.nla.gov.au/v3/contributor/\",\n", " params=contrib_params,\n", " headers=HEADERS,\n", " timeout=60,\n", " )\n", " data = response.json()\n", " for record in data[\"contributor\"]:\n", " save_summary(contributors, record)\n", " return contributors[:10]\n", "\n", " get_contributors = get_contributors_sample\n", "\n", " licence_counts_not_books = licence_counts_by_institution('NOT format:\"Book\"')\n", "\n", " df = pd.DataFrame(licence_counts_not_books)\n", "\n", " licence_counts_out_of_copyright = licence_counts_by_institution(\n", " \"format:Photograph date:[* TO 1954]\"\n", " )\n", "\n", " df2 = pd.DataFrame(licence_counts_out_of_copyright)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/). \n", "Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "rocrate": { "action": [ { "description": "This dataset includes information about the application of licences and rights statements to images by Trove contributors.", "isPartOf": "https://github.com/GLAM-Workbench/trove-images-rights-data/", "mainEntityOfPage": "https://glam-workbench.net/trove-images/trove-images-rights-data/", "name": "Licences and rights statements applied to images by Trove contributors", "result": [ { "description": "This dataset lists the number of images with each rights statement from organisations contributing to Trove.", "license": "https://creativecommons.org/publicdomain/zero/1.0/", "url": "https://github.com/GLAM-Workbench/trove-images-rights-data/raw/main/rights-on-images.csv" }, { "description": "This dataset lists the number of out-of-copyright photographs with each rights statement from organisations contributing to Trove.", "license": "https://creativecommons.org/publicdomain/zero/1.0/", "url": "https://github.com/GLAM-Workbench/trove-images-rights-data/raw/main/rights-on-out-of-copyright-photos.csv" } ] } ], "author": [ { "mainEntityOfPage": "https://timsherratt.au", "name": "Sherratt, Tim", "orcid": "https://orcid.org/0000-0001-7956-4498" } ], "description": "This notebook uses Trove's `rights` index to build a picture of which licences and rights statements are currently being applied to images, and by who.", "mainEntityOfPage": "https://glam-workbench.net/trove-images/use-of-rights-statements/", "name": "The use of standard licences and rights statements in Trove image records", "url": "https://github.com/GLAM-Workbench/trove-images/blob/master/rights-statements-on-images.ipynb" } }, "nbformat": 4, "nbformat_minor": 4 }