{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Peel's Prairie Provinces Historical Postcards: XML metadata extractor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This code extracts the metadata from the Peel's Prarie Provinces Historical Postcards. The full data set (consisting of 13,472 unique records) can be downloaded from the <a href=\"https://doi.org/10.7939/DVN/10709\" target=\"_blank\">University of Alberta Libraries Dataverse</a>. Information about the Peel's Prairie Provinces Historical Postcards project (including photographs of the postcards themselves) can be found <a href=\"http://peel.library.ualberta.ca/collections.html\" target=\"_blank\">at this link</a>.\n",
    "\n",
    "In all honesty, this code takes a 'brute force' approach to extracting the data, meaning that the resulting CSV file requires some modest cleanup. This mostly involves removing quotations and square brackets, though there are a number of records (between 50-100) that are a bit messier. This is due to multiple records existing within one XML file. For these reasons, this code remains a work-in-progress that may be updated periodically. \n",
    "\n",
    "That being said, the code works. It extracts a selection of information from the XML file containing the metatdata. These information include: postcard ID, date, city, province, country, postcard title, photograph URL, subject terms, and the names and roles of the producers or photographers. The XML file do contain some other fields and these can be easily added to the code, following the conventions of the existing code. The extraction is accomplished using <a href=\"http://www.w3schools.com/xsl/xpath_intro.asp\" target=\"_blank\">XPath</a>, which is a syntax that defines parts of XML documents. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This code block imports the Python modules needed for the extraction and defines a shortcut for the namespace profile. A <a href=\"http://www.w3schools.com/xml/xml_namespaces.asp\" target=\"_blank\">namespace</a> is a method of avoiding name conflicts in XML files by identifying the vocaulary used to code the document. In this case, the namespace refers to the <a href=\"http://www.loc.gov/mods/v3\" target=\"_blank\">Metadata Object Description Schema</a>, which defines the elements, attributes, and type definitions used to create the structure of these XML documents. \n",
    "\n",
    "Just a note that the light blue text followed by the # in the code blocks are comments that describe each specific line of code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import csv # facilitates writing to a CSV file\n",
    "import os # facilitates file directory navigation\n",
    "import lxml # provides XML parsing functionality\n",
    "from lxml import etree # allows for XML navigation via XPath\n",
    "\n",
    "ns = {'v3': 'http://www.loc.gov/mods/v3'} # the namespace is known in the code as 'ns' for brevity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This code provides a path for the location of the XML files. When running this code on your own machine, the only thing you will need to update is this information. The path to the file directory will depend entirely on where you saved the files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "path = \"/Users/Sandy/Documents/Work/DI/Peel/postcards\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's a screenshot of one of the XML files. The dark blue text are the XML elements, the orange text are the attributes, and the pink text contain the attribute values. The text in black are the text we are trying to extract."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"xmlfile.png\" alt=\"xml file\" style=\"width: 600px;\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is where the work of the code begins. This piece walks through the directory, identifying the file names and adding them to a list variable called `dirs`. Then it iterates through each file in the directory, parsing the XML from each file into a list variable called `parsed`, where each item in the list contains all the XML from one complete file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dirs = os.listdir(path) # compiles a list of filenames in the directory\n",
    "parsed = [] # creates an empty list\n",
    "for filename in dirs: # iterate through each file\n",
    "    page = os.path.join(path, filename) # joins the filename to the path\n",
    "    if os.path.isfile(page): # checks that the file exists\n",
    "        parsed.append(etree.parse(page)) # parses XML in file and appends to 'parsed'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to make sure that we have identified and parsed all of the files. Here we display the number of filenames saved to the variable `dirs`, and double check with the number of files saved on our machine."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's a screenshot of the file directory on my machine. If you look on the bottom of the image you can see the file count matches the number below the image in the `dirs` variable. This type of error checking is essential.\n",
    "\n",
    "<img src=\"filecount.png\" alt=\"file count\" style=\"width: 400px;\"/>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 13472 XML files in the directory\n"
     ]
    }
   ],
   "source": [
    "print(\"There are\", len(dirs), \"XML files in the directory\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a rather large chunk of code that uses XPath to navigate through the variable `parsed` to pick out all of the metadata that has been identified, printing it to a CSV file named 'PeelOutput'. Not all the metatdata are collected here, only the ones that I wanted to extract are included. If more metadata are required (such as an alternate URL) it can easily be added to the code, following the patterns that exist here. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with open('PeelOutput.csv', 'w') as csvfile: # opens an empty CSV called 'PeelOutput'\n",
    "    fieldnames = ['id', 'date', 'city', 'province','country', 'title','url', \\\n",
    "                  'subject1', 'nameCorporate', 'namePersonal', 'role'] # creates the column names\n",
    "    writer = csv.DictWriter(csvfile,fieldnames=fieldnames) # initializes the CSV writer\n",
    "    writer.writeheader()\n",
    "    for card in parsed:# iterate through each file\n",
    "        for e in card.xpath('//v3:mods', namespaces=ns):\n",
    "            ident = e.xpath('//v3:recordIdentifier/text()',namespaces=ns) # extracts the ID\n",
    "            date = e.xpath('//v3:dateIssued/text()', namespaces=ns) # extracts the date\n",
    "            city = e.xpath('//v3:city/text()', namespaces=ns) # extracts the city \n",
    "            prov = e.xpath('//v3:province/text()', namespaces=ns) # extracts the province\n",
    "            country = e.xpath('//v3:country/text()', namespaces=ns) #extracts the country\n",
    "            title = e.xpath('//v3:title/text()', namespaces=ns) # extracts the title\n",
    "            nameCorporate = e.xpath('(//v3:name[@type=\"corporate\"]/*/text())[1]', namespaces=ns) # extracts a corporate name\n",
    "            namePersonal = e.xpath('(//v3:name[@type=\"personal\"]/*/text())[1]', namespaces=ns) # extracts a personal name\n",
    "            role = e.xpath('(//v3:roleTerm/text())[1]', namespaces=ns) # extracts the role, related to corp or pers name above\n",
    "            url = e.xpath('//v3:url[@usage=\"primary display\"]/text()', namespaces=ns) # extracts the URL of the postcard\n",
    "            subject1 = e.xpath('//v3:topic/text()', namespaces=ns) # extracts the subject terms\n",
    "            writer.writerow({'id': ident, 'date':date, 'city': city, \\\n",
    "                             'province':prov, 'country':country, \\\n",
    "                             'title':title,'nameCorporate':nameCorporate, \\\n",
    "                             'namePersonal':namePersonal, 'role':role, 'url':url, \\\n",
    "                             'subject1':subject1}) # writes each piece of data to each column"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A file called 'PeelOutput.csv' now exists in the directory. Everytime this code is run, it will overwrite that file. The next step is importing the CSV file into your spreadsheet editor of choice and cleaning up the data. The quickest way to do this is to use the spreadsheet 'find and replace' tool. Search for all of the square brackets and quotation marks and replace them with nothing."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}