{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d552f7e6",
   "metadata": {},
   "source": [
    "# Data quality assessment of the Moving Image Archive dataset\n",
    "\n",
    "Created in October-December 2022 for the National Library of Scotland's Data Foundry by [Gustavo Candela, National Librarian’s Research Fellowship in Digital Scholarship 2022-23](https://data.nls.uk/projects/the-national-librarians-research-fellowship-in-digital-scholarship-2022-23/)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "074b39fc",
   "metadata": {},
   "source": [
    "### About the Moving Image Archive Dataset\n",
    "\n",
    "This dataset represents the descriptive metadata from the Moving Image Archive catalogue, which is Scotland’s national collection of moving images.\n",
    "\n",
    "- Data format: metadata available as MARCXML and Dublin Core\n",
    "- Data source: https://data.nls.uk/data/metadata-collections/moving-image-archive/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d1787be",
   "metadata": {},
   "source": [
    "### Table of contents\n",
    "\n",
    "- [Preparation](#Preparation)\n",
    "- [Loading the dataset](#Loading-the-dataset)\n",
    "- [Using the website assess the data](#Using-the-website-assess-the-data)\n",
    "- [Let's explore the subjects](#Let's-explore-the-subjects)\n",
    "- [Let's explore the authors](#Let's-explore-the-authors)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46d8e8b7",
   "metadata": {},
   "source": [
    "### Citations\n",
    "\n",
    "- Candela, G., Sáez, M. D., Escobar, P., & Marco-Such, M. (2022). Reusing digital collections from GLAM institutions. Journal of Information Science, 48(2), 251–267. https://doi.org/10.1177/0165551520950246"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b7b1cd2",
   "metadata": {},
   "source": [
    "### Preparation\n",
    "\n",
    "Import the libraries required to query the RDF dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a54d1c34",
   "metadata": {},
   "outputs": [],
   "source": [
    "from rdflib import Graph"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84e77195",
   "metadata": {},
   "source": [
    "### Loading the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "62b0a9b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a Graph\n",
    "g = Graph().parse(\"../rdf/datasetEnriched.ttl\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5677e0cb",
   "metadata": {},
   "source": [
    "### Using the website assess the data\n",
    "\n",
    "According to the [Moving Image Archive website](https://movingimage.nls.uk/search?subject=37), the current total number of records associated to the subject Agriculture is 410. \n",
    "\n",
    "<img src=\"../images/mia-website-agriculture.png\" width=\"50%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99e6d751",
   "metadata": {},
   "source": [
    "The command line `grep` can be used to identify text in the dataset. It provides several parameters to configure the instruction such as `-n4` that retrieves 4 lines per occurrence.\n",
    "\n",
    "For example, `grep` can be used in order to identify the number of videos related to a subject (e.g., Agriculture). However, note that the word may appear in several fields, not only in 653 MARC field. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7fb51ed",
   "metadata": {},
   "source": [
    "<img src=\"../images/grep-mia.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "623c145a",
   "metadata": {},
   "source": [
    "We can retrieve the number of occurrences using the command line wc -l"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f23b4b7",
   "metadata": {},
   "source": [
    "<img src=\"../images/grep-mia-wc.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23b06418",
   "metadata": {},
   "source": [
    "### Let's explore the subjects\n",
    "\n",
    "#### Let's check how many property dc:subject contains the text \"Agriculture\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5f9c56cb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "##### check Agriculture subject:\n",
      "Agriculture 379\n"
     ]
    }
   ],
   "source": [
    "print('##### check Agriculture subject:')\n",
    "    \n",
    "# Query the data in g using SPARQL\n",
    "q = \"\"\"\n",
    "    SELECT ?subject (COUNT(?s) as ?count) \n",
    "    WHERE { ?s dc:subject ?subject . FILTER  regex(str(?subject),\"Agriculture\")} \n",
    "    GROUP BY ?subject\n",
    "    ORDER BY DESC(?count)\n",
    "\"\"\"\n",
    "\n",
    "# Apply the query to the graph and iterate through results\n",
    "for r in g.query(q):\n",
    "    print(str(r[\"subject\"]) + \" \" + str(r[\"count\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63a68814",
   "metadata": {},
   "source": [
    "#### Let's check the number of items containing the subject Gaelic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "3eab5a4d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "##### total number of items containing a property dc:subject with the text Gaelic:\n",
      "1943\n"
     ]
    }
   ],
   "source": [
    "print('##### total number of items containing a property dc:subject with the text Gaelic:')\n",
    "    \n",
    "# Query the data in g using SPARQL\n",
    "q = \"\"\"\n",
    "    SELECT (COUNT(?subject) as ?total) \n",
    "    WHERE { ?s dc:subject ?subject . FILTER regex(?subject, \"Gaelic\")} \n",
    "\"\"\"\n",
    "\n",
    "# Apply the query to the graph and iterate through results\n",
    "for r in g.query(q):\n",
    "    print(str(r[\"total\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9327550",
   "metadata": {},
   "source": [
    "### Let's explore the authors\n",
    "\n",
    "#### How many videos are associated to each author?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "6096e4e0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://example.org/author/scottishscreencollection 699\n",
      "https://example.org/author/grampiantelevision 499\n",
      "https://example.org/organization/templarfilms 318\n",
      "https://example.org/author/edinburghcineandvideosocietyecvs 293\n",
      "https://example.org/author/scottishballet 171\n",
      "https://example.org/organization/filmsofscotlandcommittee 166\n",
      "https://example.org/author/scottishindependencereferendumcollection2014 157\n",
      "https://example.org/author/scottisheducationalfilmassociationsefa 148\n",
      "https://example.org/organization/campbellharperproductions 130\n",
      "https://example.org/organization/scottishamateurfilmfestivalsaff 117\n"
     ]
    }
   ],
   "source": [
    "# Query the data in g using SPARQL\n",
    "q = \"\"\"\n",
    "    SELECT ?author (COUNT(distinct ?s) as ?count) \n",
    "    WHERE { ?s schema:author ?author} \n",
    "    GROUP BY ?author\n",
    "    HAVING (count(distinct ?s) > 100)\n",
    "    ORDER BY DESC(?count)\n",
    "\"\"\"\n",
    "\n",
    "# Apply the query to the graph and iterate through results\n",
    "for r in g.query(q):\n",
    "    print(str(r[\"author\"]) + \" \" + str(r[\"count\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e37d38dc",
   "metadata": {},
   "source": [
    "#### Let's check how many resources are linked to a particular author"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "fd8eb963",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "37\n"
     ]
    }
   ],
   "source": [
    "# Query the data in g using SPARQL\n",
    "q = \"\"\"\n",
    "    PREFIX foaf: <http://xmlns.com/foaf/0.1/>\n",
    "    SELECT (count(distinct ?s) as ?total)\n",
    "    WHERE {?s schema:author <https://example.org/organization/ifascotland>} \n",
    "\"\"\"\n",
    "\n",
    "# Apply the query to the graph and iterate through results\n",
    "for r in g.query(q):\n",
    "    print(r[\"total\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5486e33e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WHY SCOTLAND, WHY EAST KILBRIDE\n",
      "AMAZING MOMENTS OF THE GREAT TRACTION ENGINES, the\n",
      "DRAM LIKE THIS, a\n",
      "COUNTY ON THE MOVE\n",
      "LIVINGSTON - A TOWN FOR THE LOTHIANS\n",
      "MACKINTOSH\n",
      "DUNA BULL, the\n",
      "LIVING IN SCOTLAND / SMITHS IN SCOTLAND, the\n",
      "KIND OF SEEING: The Colour of Scotland, a\n",
      "SONGS OF SCOTLAND\n",
      "SEA CITY - GREENOCK\n",
      "ONE DAY IN IRVINE\n",
      "IN GREAT WATERS\n",
      "LOCH LOMOND\n",
      "WATER OF LIFE, the\n",
      "WATER, WATER EVERYWHERE\n",
      "ERSKINE BRIDGE, the\n",
      "LINE TO SKYE, the\n",
      "CONSEQUENCES\n",
      "SETTLED OUT OF COURT: Children's Hearings in Scotland\n",
      "LOOK TO THE SEA: The Marine Laboratory Aberdeen, Its Work and Its People\n",
      "EDINBURGH ON PARADE\n",
      "GARDENS BY THE SEA\n",
      "COME AWAY IN\n",
      "WALKABOUT EDINBURGH\n",
      "GRAND MATCH, the\n",
      "STONE IN THE HEATHER, a\n",
      "SPIRIT OF SCOTLAND, the\n",
      "KH-4\n",
      "LINE FOR ALL SEASONS, a\n",
      "DISAPPEARING ISLAND, the\n",
      "GET THERE SAFELY\n",
      "FALLS THE SHADOW\n",
      "DIAMONDS WERE FOREVER: A Celebration of Glasgow Steam\n",
      "ACRE OF SUNDAYS, an\n",
      "CHAIRS\n",
      "SHELL SHOCK\n"
     ]
    }
   ],
   "source": [
    "# Query the data in g using SPARQL\n",
    "q = \"\"\"\n",
    "    PREFIX foaf: <http://xmlns.com/foaf/0.1/>\n",
    "    SELECT ?name\n",
    "    WHERE {?s schema:author <https://example.org/organization/ifascotland> . ?s schema:name ?name} \n",
    "\"\"\"\n",
    "\n",
    "# Apply the query to the graph and iterate through results\n",
    "for r in g.query(q):\n",
    "    print(r[\"name\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ac2eabe2",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}