{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dataset Extraction Example\n",
    "\n",
    "This notebook extracts a dataset from a digital collection described using MARCXML files, including descriptive metadata from the [Moving Image Archive](https://data.nls.uk/data/metadata-collections/moving-image-archive/) catalogue, which is Scotland’s national collection of moving images."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setting up things"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import the libraries we need\n",
    "# https://pypi.org/project/pymarc/\n",
    "import pymarc, re, csv\n",
    "import pandas as pd\n",
    "from pymarc import parse_xml_to_array\n",
    "from datapackage import Package"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reading original files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "csv_out = csv.writer(open('marc_records.csv', 'w'), delimiter = ',', quotechar = '\"', quoting = csv.QUOTE_MINIMAL)\n",
    "csv_out.writerow(['title', 'author', 'place_production', 'date', 'extents', 'credits_note', 'subjects', 'summary', 'detail', 'link'])\n",
    "\n",
    "records = parse_xml_to_array(open('Moving-Image-Archive/Moving-Image-Archive-dataset-MARC.xml'))\n",
    "\n",
    "for record in records:\n",
    "    \n",
    "    title = author = place_production = date = extents = credits_note = subjects = summary = publisher = link =''\n",
    "    \n",
    "    # title\n",
    "    if record['245'] is not None:\n",
    "      title = record['245']['a']\n",
    "      if record['245']['b'] is not None:\n",
    "        title = title + \" \" + record['245']['b']\n",
    "    \n",
    "    # determine author\n",
    "    if record['100'] is not None:\n",
    "      author = record['100']['a']\n",
    "    elif record['110'] is not None:\n",
    "      author = record['110']['a']\n",
    "    elif record['700'] is not None:\n",
    "      author = record['700']['a']\n",
    "    elif record['710'] is not None:\n",
    "      author = record['710']['a']\n",
    "    \n",
    "    # place_production\n",
    "    if record['264'] is not None:\n",
    "      place_production = record['264']['a']\n",
    "    \n",
    "    # date\n",
    "    for f in record.get_fields('264'):\n",
    "        dates = f.get_subfields('c')\n",
    "        if len(dates):\n",
    "            date = dates[0]\n",
    "            # cleaning date last .\n",
    "            if date.endswith('.'): date = date[:-1]\n",
    "    \n",
    "    \n",
    "    # Physical Description - extent\n",
    "    for f in record.get_fields('300'):\n",
    "        extents = f.get_subfields('a')\n",
    "        if len(extents):\n",
    "            extent = extents[0]\n",
    "            # TODO cleaning\n",
    "        details = f.get_subfields('b')\n",
    "        if len(details):\n",
    "            detail = details[0]\n",
    "            \n",
    "    # Creation/production credits note\n",
    "    if record['508'] is not None:\n",
    "      credits_note = record['508']['a']\n",
    "    \n",
    "    # Summary\n",
    "    if record['520'] is not None:\n",
    "      summary = record['520']['a']\n",
    "    \n",
    "    # subject\n",
    "    if record['653'] is not None:\n",
    "        subjects = '' \n",
    "        for f in record.get_fields('653'):\n",
    "            subjects += f.get_subfields('a')[0] + ' -- '\n",
    "        subjects = re.sub(' -- $', '', subjects)\n",
    "    \n",
    "    \n",
    "    # link\n",
    "    if record['856'] is not None:\n",
    "      link = record['856']['u']\n",
    "    \n",
    "    \n",
    "    csv_out.writerow([title,author,place_production,date,extents,credits_note,subjects,summary,detail,link])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Data Package\n",
    "[Data Package](https://specs.frictionlessdata.io/data-package/) is a simple container format for describing a coherent collection of data in a single 'package'. It provides the basis for convenient delivery, installation and management of datasets. There is a [Python library](https://github.com/frictionlessdata/datapackage-py) for working with Data Packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "package = Package()\n",
    "package.infer('marc_records.csv')\n",
    "package.descriptor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save the data package\n",
    "The Data Package contains the data and the descriptor as a zip file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "package.save('datapackage.zip')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading CSV \n",
    "We can also read the CSV file to explore the metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This puts the data in a Pandas DataFrame\n",
    "df = pd.read_csv('marc_records.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Have a peek"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's have a look inside...\n",
    "# Note that both the columns and rows are truncated in this preview\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create some summary data\n",
    "We can use Pandas to give us a quick overview of the dataset.\n",
    "### What are the column headings?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How many records are there?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring topics\n",
    "### Create a list of unique topics and sort them alphabetically"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['subjects'][2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['subjects'].str.split('--', expand=True).stack()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get unique values\n",
    "topics = pd.unique(df['subjects'].str.split(' -- ', expand=True).stack()).tolist()\n",
    "for topic in sorted(topics, key=str.lower):\n",
    "    print(topic)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How often is each topic used?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Splits the topic column and counts frequencies\n",
    "topic_counts = df['subjects'].str.split('--').apply(lambda x: pd.Series(x).value_counts()).sum().astype('int').sort_values(ascending=False).to_frame().reset_index(level=0)\n",
    "# Add column names\n",
    "topic_counts.columns = ['subject', 'count']\n",
    "# Display with horizontal bars\n",
    "display(topic_counts.style.bar(subset=['count'], color='#d65f5f').set_properties(subset=['count'], **{'width': '300px'}))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}