{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dataset Extraction Example\n", "\n", "This notebook extracts a dataset from a digital collection described using MARCXML files, including descriptive metadata from the [Moving Image Archive](https://data.nls.uk/data/metadata-collections/moving-image-archive/) catalogue, which is Scotland’s national collection of moving images." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting up things" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import the libraries we need\n", "# https://pypi.org/project/pymarc/\n", "import pymarc, re, csv\n", "import pandas as pd\n", "from pymarc import parse_xml_to_array\n", "from datapackage import Package" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading original files" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "csv_out = csv.writer(open('marc_records.csv', 'w'), delimiter = ',', quotechar = '\"', quoting = csv.QUOTE_MINIMAL)\n", "csv_out.writerow(['title', 'author', 'place_production', 'date', 'extents', 'credits_note', 'subjects', 'summary', 'detail', 'link'])\n", "\n", "records = parse_xml_to_array(open('Moving-Image-Archive/Moving-Image-Archive-dataset-MARC.xml'))\n", "\n", "for record in records:\n", " \n", " title = author = place_production = date = extents = credits_note = subjects = summary = publisher = link =''\n", " \n", " # title\n", " if record['245'] is not None:\n", " title = record['245']['a']\n", " if record['245']['b'] is not None:\n", " title = title + \" \" + record['245']['b']\n", " \n", " # determine author\n", " if record['100'] is not None:\n", " author = record['100']['a']\n", " elif record['110'] is not None:\n", " author = record['110']['a']\n", " elif record['700'] is not None:\n", " author = record['700']['a']\n", " elif record['710'] is not None:\n", " author = record['710']['a']\n", " \n", " # place_production\n", " if record['264'] is not None:\n", " place_production = record['264']['a']\n", " \n", " # date\n", " for f in record.get_fields('264'):\n", " dates = f.get_subfields('c')\n", " if len(dates):\n", " date = dates[0]\n", " # cleaning date last .\n", " if date.endswith('.'): date = date[:-1]\n", " \n", " \n", " # Physical Description - extent\n", " for f in record.get_fields('300'):\n", " extents = f.get_subfields('a')\n", " if len(extents):\n", " extent = extents[0]\n", " # TODO cleaning\n", " details = f.get_subfields('b')\n", " if len(details):\n", " detail = details[0]\n", " \n", " # Creation/production credits note\n", " if record['508'] is not None:\n", " credits_note = record['508']['a']\n", " \n", " # Summary\n", " if record['520'] is not None:\n", " summary = record['520']['a']\n", " \n", " # subject\n", " if record['653'] is not None:\n", " subjects = '' \n", " for f in record.get_fields('653'):\n", " subjects += f.get_subfields('a')[0] + ' -- '\n", " subjects = re.sub(' -- $', '', subjects)\n", " \n", " \n", " # link\n", " if record['856'] is not None:\n", " link = record['856']['u']\n", " \n", " \n", " csv_out.writerow([title,author,place_production,date,extents,credits_note,subjects,summary,detail,link])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Data Package\n", "[Data Package](https://specs.frictionlessdata.io/data-package/) is a simple container format for describing a coherent collection of data in a single 'package'. It provides the basis for convenient delivery, installation and management of datasets. There is a [Python library](https://github.com/frictionlessdata/datapackage-py) for working with Data Packages." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "package = Package()\n", "package.infer('marc_records.csv')\n", "package.descriptor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Save the data package\n", "The Data Package contains the data and the descriptor as a zip file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "package.save('datapackage.zip')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading CSV \n", "We can also read the CSV file to explore the metadata" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This puts the data in a Pandas DataFrame\n", "df = pd.read_csv('marc_records.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Have a peek" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's have a look inside...\n", "# Note that both the columns and rows are truncated in this preview\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create some summary data\n", "We can use Pandas to give us a quick overview of the dataset.\n", "### What are the column headings?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How many records are there?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring topics\n", "### Create a list of unique topics and sort them alphabetically" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['subjects'][2]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['subjects'].str.split('--', expand=True).stack()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get unique values\n", "topics = pd.unique(df['subjects'].str.split(' -- ', expand=True).stack()).tolist()\n", "for topic in sorted(topics, key=str.lower):\n", " print(topic)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How often is each topic used?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Splits the topic column and counts frequencies\n", "topic_counts = df['subjects'].str.split('--').apply(lambda x: pd.Series(x).value_counts()).sum().astype('int').sort_values(ascending=False).to_frame().reset_index(level=0)\n", "# Add column names\n", "topic_counts.columns = ['subject', 'count']\n", "# Display with horizontal bars\n", "display(topic_counts.style.bar(subset=['count'], color='#d65f5f').set_properties(subset=['count'], **{'width': '300px'}))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }