{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Extracting RDF graphs from TEI/XML documents using lxml.etree and RDFLib " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Introduction\n", "\n", "This Jupyter notebook is a step-by-step guide to the extraction of RDF graphs from TEI/XML documents using lxml and RDFLib, as suggested by LIFT. \n", "LIFT is an open-source web application based entirely on Python. The aim of LIFT is to show and demonstrate how it is possible to extract RDF graphs, supported by widely adopted ontological vocabularies, from TEI/XML documents. \n", "This notebook will show you how to leverage the lxml.etree library to parse TEI/XML documents and the RDFLib library to build RDF statements using the information extracted from the TEI input file.\n", "\n", " \n", "**TEI/XML** - the standard vocabulary for textual encoding in the humanities \n", "**lxml.etree** - a Python library for XML processing \n", "**RDFLib** - a Python library for working with RDF " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Installing lxml and RDFLib\n", "\n", "Firstly, if you do not already have it, install lxml onto your computer by following the instructions provided at this link: . \n", " \n", "Do the same for RDFLib. Information on how to install the library is available at ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Building the TEI to RDF extraction script " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following blocks of code are ideally stored into a single Python file, which you can create and name something like `TEItoRDF.py`. Alternatively, remember that you can download this Jupyter notebook as a Python file by clicking on File > Download as > Python (.py). Let's go!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Importing lxml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Starting with an empty Python file, we begin by importing lxml.etree (a library for processing XML using Python, cf. section 1) into our script:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from lxml import etree" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To read from a TEI/XML file (further on referred to as 'input' or 'TEI document'), we use the `parse()` function:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "tree = etree.parse('input-test.xml')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make sure to specify the correct path. In this case, the file `input-test.xml` is stored in the current folder. For a basic introduction to paths see ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to retrieve the root element of the TEI document (i.e. `input-test.xml`), we use the function `getroot()` and store the result in the 'root' variable:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "root = tree.getroot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also assign the values of the TEI attributes `@xml:base` and `@xml:id`, which are attached to the root element of the TEI document, to the variables 'base_uri' and 'edition_id' respectively. These will come handy when generating entity URIs. \n", "In order to retrieve the attributes we leverage the `get()` function (note how we substituted the prefix 'xml' with the actual namespace, this is the canonical way of working with attributes belonging to the xml namespace in lxml):" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "base_uri = root.get('{http://www.w3.org/XML/1998/namespace}base')\n", "edition_id = root.get('{http://www.w3.org/XML/1998/namespace}id')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then bind the TEI namespace to the prefix 'tei' (we will use this later to refer to TEI elements) as follows:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "tei = {'tei': 'http://www.tei-c.org/ns/1.0'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Importing RDFLib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Firstly, we import the Graph, Literal, BNode, Namespace and URIRef classes from RDFLib as follows:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from rdflib import Graph, Literal, BNode, Namespace, URIRef" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Secondly, we declare the namespaces of the ontological vocabularies that are going to provide the semantics of the resulting RDF graph.\n", "Some namespaces are available by direct import from RDFLib, so we can simply type:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from rdflib.namespace import RDF, RDFS, XSD, DCTERMS, OWL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any other namespace is to be declared in the following way (these are the ontologies used in LIFT):" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "agrelon = Namespace(\"https://d-nb.info/standards/elementset/agrelon#\")\n", "crm = Namespace(\"http://www.cidoc-crm.org/cidoc-crm/\")\n", "frbroo = Namespace(\"http://iflastandards.info/ns/fr/frbr/frbroo/\")\n", "pro = Namespace(\"http://purl.org/spar/pro/\")\n", "proles = Namespace(\"http://www.essepuntato.it/2013/10/politicalroles/\")\n", "prov = Namespace(\"http://www.w3.org/ns/prov#\")\n", "ti = Namespace(\"http://www.essepuntato.it/2012/04/tvc/\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An RDFLib graph is a set of RDF triples. We declare our output graph and name it 'g':" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "g = Graph()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the function `bind()`, we bind each of our namespaces to a prefix:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "g.bind(\"agrelon\", agrelon)\n", "g.bind(\"crm\", crm)\n", "g.bind(\"frbroo\", frbroo)\n", "g.bind(\"dcterms\", DCTERMS)\n", "g.bind(\"owl\", OWL)\n", "g.bind(\"pro\", pro)\n", "g.bind(\"proles\", proles)\n", "g.bind(\"prov\", prov)\n", "g.bind(\"ti\", ti)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Personal entities" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to iterate through all `` elements in the TEI document, we use the lxml `findall()` method, which takes as an argument a simple XPath-like language called ElementPath and returns a list of matching elements (line 1). Then, for each person in the TEI document, we:\n", "1. Extract the person's `@xml:id` (line 2).\n", "2. Build a unique URI for the person by concatenating the 'base_uri' from above with the person's `@xml:id`. In order to make clear what kind of resource the URI represents, we also add the directory `/person/` before the actual person's `@xml:id` (line 3).\n", "3. We add our first triple to the RDF graph: the subject of the RDF statement is the person, the predicate is `rdf:type`, the object is `crm:E21_Person`. This triple states that the person belongs to the class (line 4). \n", " \n", "We suggest that you keep the TEI document `input-test.xml` within sight, to better grasp how the extraction script works." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "for person in root.findall('.//tei:person', tei):\n", " person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')\n", " person_uri = URIRef(base_uri + '/person/' + person_id)\n", " g.add( (person_uri, RDF.type, crm.E21_Person))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, run the following `print()` functions to print out the set of triples just generated (this is just a test, which you can make at any time during this tutorial; at the end, we will print out the RDF graph to a file). RDFLib allows us to choose among different serialization formats, such as xml, n3, and nt:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RDF/XML serialization:\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "Notation3 serialization:\n", "\n", "@prefix agrelon: .\n", "@prefix crm: .\n", "@prefix dcterms: .\n", "@prefix frbroo: .\n", "@prefix owl: .\n", "@prefix pro: .\n", "@prefix proles: .\n", "@prefix prov: .\n", "@prefix rdf: .\n", "@prefix rdfs: .\n", "@prefix ti: .\n", "@prefix xml: .\n", "@prefix xsd: .\n", "\n", " a crm:E21_Person .\n", "\n", " a crm:E21_Person .\n", "\n", " a crm:E21_Person .\n", "\n", " a crm:E21_Person .\n", "\n", " a crm:E21_Person .\n", "\n", "\n", "N-triples serialization:\n", "\n", " .\n", " .\n", " .\n", " .\n", " .\n", "\n", "\n" ] } ], "source": [ "print('RDF/XML serialization:\\n')\n", "print(g.serialize(format='xml'))\n", "\n", "print('Notation3 serialization:\\n')\n", "print(g.serialize(format='n3'))\n", "\n", "print('N-triples serialization:\\n')\n", "print(g.serialize(format='nt'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Moving on, we look for a `@sameAs` attribute provided on the `` element. We expect this attribute to contain one or more URIs pointing to authority records such as VIAF, or to resources about the same person such as DBpedia:\n", "1. Using the `get()` function, we look for a `@sameAs` attribute and .\n", "2. If the attribute exists, split its contents by whitespace (line 6) and then we loop through the list of URIs as many times as the total number of URIs stored in the `@sameAs` attribute (lines 7-11). We record the URIs in the variable 'same_as_uri' (line 9).\n", "3. Finally, we add a triple to the RDF graph at each loop (line 10): the subject of the RDF statement is the person, the predicate is `owl:sameAs`, the object is the URI retrieved from within the `@sameAs` attribute. For example, if a `@sameAs` attribute contains two URIs, two distinct RDF triples are added to the graph. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "for person in root.findall('.//tei:person', tei):\n", " person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')\n", " person_uri = URIRef(base_uri + '/person/' + person_id)\n", " same_as = person.get('sameAs')\n", " if same_as is not None:\n", " same_as = same_as.split()\n", " i = 0\n", " while i < len(same_as):\n", " same_as_uri = URIRef(same_as[i])\n", " g.add( (person_uri, OWL.sameAs, same_as_uri))\n", " i += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step is to provide each person entity with a human-readable label:\n", "1. We iterate again through all persons, looking for `` elements (lines 1-4).\n", "2. If we find a personal name, we store the content of such an element in a 'label' variable, as well as see if an `@xml:lang` attribute is also present (line 6-7).\n", "3. If an `@xml:lang` is found, the script adds an RDF triple. The subject of such a triple is the person, the predicate is `rdf:label`, and the object is a literal value (i.e. an `xsd:string`). A language declaration is also attached to the triple (e.g. `xml:lang='en'` for English) (line 8).\n", "4. If no `@xml:lang` is found, the script creates an RDF triple whithout declaring any specific language (line 10)." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "for person in root.findall('.//tei:person', tei):\n", " person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')\n", " person_uri = URIRef(base_uri + '/person/' + person_id)\n", " persname = person.find('./tei:persName', tei)\n", " if persname is not None:\n", " label = persname.text \n", " if persname.get('{http://www.w3.org/XML/1998/namespace}lang') is not None:\n", " label_lang = persname.get('{http://www.w3.org/XML/1998/namespace}lang')\n", " g.add( (person_uri, RDFS.label, Literal(label, lang=label_lang)))\n", " else:\n", " g.add( (person_uri, RDFS.label, Literal(label)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In TEI, groups of somehow related `` elements (e.g. they are of the same type) are usually nested within a common `` element. The following script retrieves any potential @type or @corresp attributes on ``. These should contain a natural language description of the person's type or anauthority record URI respectively:\n", "1. We look for a `` parent element (line 4).\n", "2. We retrieve the attributes `@type` and/or `@corresp` (lines 5-6).\n", "3. If a `@type` attribute was found, we add an RDF triple formed by the person's URI, the property `dcterms:description` and a literal value containing a natural language description of the person's type (lines 7-8).\n", "4. If a `@corresp` attribute was found, we add an RDF triple formed by the person's URI, the property `dcterms:subject` and a URI (ideally) of an authority record (lines 9-10)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "for person in root.findall('.//tei:person', tei):\n", " person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')\n", " person_uri = URIRef(base_uri + '/person/' + person_id)\n", " listperson = person.find('./...', tei)\n", " perstype = listperson.get('type')\n", " perscorr = listperson.get('corresp')\n", " if perstype is not None:\n", " g.add( (person_uri, DCTERMS.description, Literal(perstype)))\n", " if perscorr is not None and perscorr.startswith('http'):\n", " g.add( (person_uri, DCTERMS.subject, URIRef(perscorr)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We may also be interested in extracting all references to a particular person in the text. The following script does precisely this:\n", "1. It looks for any reference to the person, i.e. any `` element in the text whose `@ref` attributes corresponds to the `@xml:id` of the person (lines 3-4).\n", "2. It retrieves the parent element of the `` and creates a unique URI for it (lines 6-7).\n", "3. It adds an RDF statement which has the person as a subject, followed by the property `dcterms:isReferencedBy`, and the parent element's URI (line 8).\n", "4. It adds two RDF statements describing the parent element's entity, which is a `frbroo:F23_Expression_Fragment` (cf. ) that is part of the TEI file (lines 9-10). " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "for person in root.findall('.//tei:person', tei):\n", " person_id = person.get('{http://www.w3.org/XML/1998/namespace}id') \n", " ref = './tei:text//tei:persName[@ref=\"#' + person_id + '\"]'\n", " for referenced_person in root.findall(ref, tei):\n", " parent = referenced_person.getparent()\n", " parent_id = parent.get('{http://www.w3.org/XML/1998/namespace}id')\n", " parent_uri = URIRef(base_uri + '/text/' + parent_id)\n", " g.add( (person_uri, DCTERMS.isReferencedBy, parent_uri))\n", " g.add( (parent_uri, RDF.type, frbroo.F23_Expression_Fragment))\n", " g.add( (parent_uri, frbroo.R15i_is_fragment_of, URIRef(base_uri + '/' + edition_id)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our person's description is now complete. \n", "Note that we could also write all of the above code by dividing it into smaller functions (i.e. `def function_name()`) as shown in the following block of code, then call the functions altogether at the end. In this way, we spare some lines of code and make our script a little bit easier to maintain:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def subject(person):\n", " g.add( (person_uri, RDF.type, crm.E21_Person))\n", " \n", "def sameas(person):\n", " same_as = person.get('sameAs')\n", " if same_as is not None:\n", " same_as = same_as.split()\n", " i = 0\n", " while i < len(same_as):\n", " same_as_uri = URIRef(same_as[i])\n", " g.add( (person_uri, OWL.sameAs, same_as_uri))\n", " i += 1\n", " \n", "def persname(person):\n", " persname = person.find('./tei:persName', tei)\n", " if persname is not None:\n", " label = persname.text\n", " label_lang = persname.get('{http://www.w3.org/XML/1998/namespace}lang')\n", " if label_lang is not None:\n", " g.add( (person_uri, RDFS.label, Literal(label, lang=label_lang)))\n", " else:\n", " g.add( (person_uri, RDFS.label, Literal(label)))\n", " \n", "def perstype(person):\n", " listperson = person.find('./...', tei)\n", " perstype = listperson.get('type')\n", " perscorr = listperson.get('corresp')\n", " if perstype is not None:\n", " g.add( (person_uri, DCTERMS.description, Literal(perstype)))\n", " if perscorr is not None and perscorr.startswith('http'):\n", " g.add( (person_uri, DCTERMS.subject, URIRef(perscorr)))\n", " \n", "def referenced_person(person_id):\n", " ref = './tei:text//tei:persName[@ref=\"#' + person_id + '\"]'\n", " for referenced_person in root.findall(ref, tei):\n", " parent = referenced_person.getparent()\n", " parent_id = parent.get('{http://www.w3.org/XML/1998/namespace}id')\n", " parent_uri = URIRef(base_uri + '/text/' + parent_id)\n", " g.add( (person_uri, DCTERMS.isReferencedBy, parent_uri))\n", " g.add( (parent_uri, RDF.type, frbroo.F23_Expression_Fragment))\n", " g.add( (parent_uri, frbroo.R15i_is_fragment_of, URIRef(base_uri + '/' + edition_id)))\n", " \n", "# Calling all functions\n", "for person in root.findall('.//tei:person', tei):\n", " person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')\n", " person_uri = URIRef(base_uri + '/person/' + person_id)\n", " person_ref = '#' + person_id\n", " subject(person)\n", " sameas(person)\n", " persname(person)\n", " referenced_person(person_id)\n", " perstype(person)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the rest of this Jupyter notebook, we will adopt this style: the script will be divided into functions, which will be called afterwords." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### 3.4 Persons participating at events" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following group of functions extract information about people participating at events. Participation to an event revolves around the conceptual class pro:RoleInTime, which represents a \"particular situation that describe a role an agent may have, that can be restricted to a particular time interval\" (). Such a class is directly related to the person, his/her role, a time, an event. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's begin by checking if a person participates (i.e. has a role) in an event:\n", "1. For each `` element within the `` element, we build a unique URI representing the participation of the individual to the event (line 2).\n", "2. We add a new RDF triple: the subject is the person, the predicate is `pro:holdsRoleInTime`, the object is the participation of the individual to the event (i.e. the `pro:RoleInTime`). You can find a visual diagram of the PRO ontology [here](https://sparontologies.github.io/pro/current/pro.html#introduction) (line 3). " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def partic_event(person): \n", " partic_event_uri = URIRef(base_uri + '/' + person_id + '-in-' + event_id)\n", " g.add( (person_uri, pro.holdsRoleInTime, partic_event_uri))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next block aims to extract information about the role held by the person in a specific event. \n", "1. Firstly, for each `` element found in ``, we add to our graph an RDF triple which assigns the participation of a person to a specific event to the class `pro:RoleInTime` (line 2). \n", "2. We then look for a `` element within the `` element (line 3). \n", "3. If we find a `` with an attribute `@ref` corresponding to that of the person record within which the event is nested (cf. the XPath expression `//person[@xml:id=\"Socr\"]//persName[@ref=\"#Socr\"]` in `input.xml`), as well as an attribute `@role` associated to this `` (line 4), we build a unique URI for such a role (line 5).\n", "4. Then, we add three RDF triples (line 6-8). The first triple relates the role-in-time played by the person at the event to the specific role; the second triple assigns the role to the class `pro:Role`; the third triple associates a human-readable label to the role entity. \n", "5. If a `@corresp` attribute is also present (line 9) (this should contain a link to an authority record for the role), the script adds an extra RDF triple to associate the authority record URI to the role entity via an `owl:sameAs` property (lines 10-11).\n", "6. Else, if the person is not mentioned within ``, we add the same triples as above with the only difference that the role is set to 'participant' (12-17). The participation of the person to the event is taken for granted as the `` element is nested within the `` element. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def role_in_event(person):\n", " g.add( (rit_uri, RDF.type, pro.RoleInTime))\n", " pers_in_event = event.find('./tei:desc/tei:persName', tei)\n", " if pers_in_event is not None and pers_in_event.get('ref') == person_ref and pers_in_event.get('role') is not None:\n", " role_uri = URIRef(base_uri + '/role/' + pers_in_event.get('role'))\n", " g.add( (rit_uri, pro.withRole, role_uri))\n", " g.add( (role_uri, RDF.type, pro.Role))\n", " g.add( (role_uri, RDFS.label, Literal(pers_in_event.get('role'))))\n", " if pers_in_event.get('corresp') is not None:\n", " corresp_role_uri = URIRef(pers_in_event.get('corresp'))\n", " g.add( (role_uri, OWL.sameAs, corresp_role_uri)) \n", " else:\n", " g.add( (rit_uri, pro.withRole, URIRef(base_uri + '/role/participant')))\n", " role_uri = URIRef(base_uri + '/role/participant')\n", " g.add( (role_uri, RDF.type, pro.Role))\n", " g.add( (role_uri, OWL.sameAs, URIRef('http://wordnet-rdf.princeton.edu/id/10421528-n')))\n", " g.add( (role_uri, RDFS.label, Literal('participant')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following block aims at extracting information about the time of an event: \n", "1. We relate the participation of a person to a specific event (i.e. a pro:RoleInTime) to a specific time (line 2-3), and assigns the time entity to the class `TimeInterval` (line 4). \n", "2. We look for `@when`, `@from`, or `@to` attributes to determine the time interval and add RDF triples on the basis of the values found (lines 5-11)." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "def event_time():\n", " event_time_uri = URIRef(base_uri + '/' + event_id + '-time')\n", " g.add( (rit_uri, ti.atTime, event_time_uri))\n", " g.add( (event_time_uri, RDF.type, URIRef('http://www.ontologydesignpatterns.org/cp/owl/timeinterval.owl#TimeInterval')))\n", " if event.get('when') is not None:\n", " g.add( (event_time_uri, OWL.hasIntervalStartDate, Literal(event.get('when'), datatype=XSD.date)))\n", " g.add( (event_time_uri, OWL.hasIntervalEndDate, Literal(event.get('when'), datatype=XSD.date)))\n", " if event.get('from') is not None:\n", " g.add( (event_time_uri, OWL.hasIntervalStartDate, Literal(event.get('from'), datatype=XSD.date)))\n", " if event.get('to') is not None:\n", " g.add( (event_time_uri, OWL.hasIntervalEndDate, Literal(event.get('to'), datatype=XSD.date)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are now ready to describe the event itself:\n", "1. First, we relate the event to the 'role-in-time' played by the person through the property `pro:relatesToEntity` (line 2).\n", "2. Then, we assign the event entity to the class `crm:E5_Event` as well as to the class `schems:Event` (lines 3-4).\n", "3. If available, we associate the event entity to a human-readable label describing it (lines 5-7).\n", "4. If available, we associate the event entity to a type (lines 8-9) and/or an http URI from an authority record (lines 10-11)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "def event_desc():\n", " g.add( (rit_uri, pro.relatesToEntity, URIRef(base_uri + '/event/' + event_id)))\n", " g.add( (event_uri, RDF.type, crm.E5_Event))\n", " if event.find('./tei:label', tei) is not None:\n", " label = event.find('./tei:label', tei).text\n", " g.add( (event_uri, RDFS.label, Literal(label)))\n", " if event.get('type') is not None:\n", " g.add( (event_uri, DCTERMS.description, Literal(event.get('type'))))\n", " if event.get('corresp') is not None and event.get('corresp').startswith('http'):\n", " g.add( (event_uri, DCTERMS.subject, URIRef(event.get('corresp'))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to extract informatio about the place where the event took place, we:\n", "1. Look for any `` element in the `` element (lines 2).\n", "2. If there is more than one, we search for the `` element to which an attribute `@type=\"place_of_event\"` is associated and add a triple relating the 'role-in-time' to that specific place. The place URI is build by concatenating the project base URI, a directory `/place/` and the unique ID for the place (lines 3-5).\n", "3. If the `` element contains only one reference to a place, the script simply uses that as a place record for the event." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "def event_place():\n", " place = event.find('./tei:desc/tei:placeName', tei)\n", " if place > 1:\n", " place_of_event = place.get('type=\"place_of_event\"')\n", " g.add( (rit_uri, proles.relatesToPlace, URIRef(base_uri + '/place/' + place.get('ref').replace(\"#\", \"\"))))\n", " elif event.find('./tei:desc/tei:placeName', tei) == 1:\n", " g.add( (rit_uri, proles.relatesToPlace, URIRef(base_uri + '/place/' + place.get('ref').replace(\"#\", \"\")))) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If a literary source for the event is cited within the `` element itself, we run the following set of instructions:\n", "1. We look for any `` element within `` (line 2).\n", "2. If a `` is found (line 3), we build a URI for it after having retrieved its unique ID (lines 4-5), then add a new RDF triple to our graph: the subject is the event entity, which is linked to the source via the property `prov:hadPrimarySource` (line 6).\n", "3. We also add an RDF triple assigning the source to the class `prov:PrimarySource` (line 7).\n", "4. The `` element may contain the elements ``, ``, and `<date>`. An RDF triple is generated for each of these metadata, if present (lines 8-16).\n", "5. Finally, we search for a `@sameAs` attribute on `<bibl>` to relate the source to a related resource or to an authority record such as [Worldcat](https://www.worldcat.org) (lines 17-20). " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "def event_source():\n", " source = event.find('./tei:bibl', tei)\n", " if source is not None:\n", " source_id = source.get('{http://www.w3.org/XML/1998/namespace}id')\n", " source_uri = URIRef(base_uri + '/source/' + source_id)\n", " g.add( (event_uri, prov.hadPrimarySource, source_uri))\n", " g.add( (source_uri, RDF.type, prov.PrimarySource))\n", " if source.find('./tei:author', tei) is not None and source.find('./tei:author', tei).get('ref') is not None:\n", " author_ref = source.find('./tei:author', tei).get('ref')\n", " author_id = author_ref.split('#')\n", " g.add( (source_uri, DCTERMS.creator, URIRef(base_uri + '/person/' + author_id[1])))\n", " if source.find('.tei:title', tei) is not None:\n", " g.add( (source_uri, DCTERMS.title, Literal(source.find('.tei:title', tei).text)))\n", " if source.find('.tei:date', tei) is not None:\n", " evdate = source.find('.tei:date', tei)\n", " g.add( (source_uri, DCTERMS.date, Literal(evdate.get('when'), datatype=XSD.date)))\n", " if source.get('sameAs') is not None:\n", " sameAs = source.get('sameAs')\n", " if sameAs.startswith('http'):\n", " g.add( (source_uri, OWL.sameAs, URIRef(source.get('sameAs')))) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we call the functions just created. If you wish, you can print out the resulting graph as done in section 3.3 of this notebook by typing `print(g.serialize(format=\"n3\"))`." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "for person in root.findall('.//tei:person', tei):\n", " person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')\n", " person_uri = URIRef(base_uri + '/person/' + person_id)\n", " person_ref = '#' + person_id\n", " for event in person.findall('./tei:event', tei):\n", " event_id = event.get('{http://www.w3.org/XML/1998/namespace}id')\n", " event_uri = URIRef(base_uri + '/event/' + event_id) \n", " rit_uri = URIRef(base_uri + '/rit/' + person_id + '-at-' + event_id)\n", " partic_event(person)\n", " role_in_event(person)\n", " event_time()\n", " event_desc()\n", " event_place()\n", " event_source() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.5 Relations between persons" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The aim of the following script is to extract information about the relationships to which a person participates. In TEI, relationships are normally encoded using the element `<relation>`, nested within the `<listPerson>` element. There are two main types of relationships: active/passive (unilateral relationship, e.g. Person A (active) is mother of Person B (passive)) and mutual (mutual relationship, e.g. Person A/B is colleague of Person B/A). \n", "1. For each person, the script iterates through all `<relation>` elements (lines 1-2).\n", "2. If an `@active` attribute containing a reference to the person is found on `<relation>` (line 3), the script iterates through all possible values of the `@passive` attribute adding an RDF triple for each of them (lines 4-8). The `@name` attribute on `<relation>` should provide a term from an ontology such as AgRelOn (<https://d-nb.info/standards/elementset/agrelon>) (line 7).\n", "3. The same is done for mutual relationships (lines 9-15). " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "def relation(person):\n", " for relation in root.findall('.//tei:listRelation/tei:relation', tei):\n", " if relation.get('active') is not None and relation.get('active') == person_ref:\n", " passive = relation.get('passive').replace(\"#\", \"\").split()\n", " i = 0\n", " while i < len(passive):\n", " g.add( (person_uri, agrelon[relation.get('name')], URIRef(base_uri + '/' + passive[i])))\n", " i += 1\n", " elif relation.get('mutual') is not None:\n", " if person_ref in relation.get('mutual').split():\n", " mutual = relation.get('mutual').replace(\"#\", \"\").replace(person_id, \"\").split()\n", " i = 0\n", " while i < len(mutual):\n", " g.add( (person_uri, agrelon[relation.get('name')], URIRef(base_uri + '/' + mutual[i])))\n", " i += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now call the `relation(person)` function:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "for person in root.findall('.//tei:person', tei):\n", " person_id = person.get('{http://www.w3.org/XML/1998/namespace}id')\n", " person_uri = URIRef(base_uri + '/person/' + person_id)\n", " person_ref = '#' + person_id\n", " relation(person)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.6 Places" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section is about describing all places mentioned in the TEI file:\n", "1. For each `<place>` element found, we add an RDF triple to our graph assigning the place entity to the class `crm:E53_Place`." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "def subject(place):\n", " g.add( (place_uri, RDF.type, crm.E53_Place))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Moving on, we look for a `@sameAs` attribute provided on the `<place>` element. We expect this attribute to contain one or more URIs pointing to authority records such as VIAF, or to resources about the same person such as DBpedia:\n", "1. Using the `get()` function, we look for a `@sameAs` attribute and split its contents by whitespace (line 2).\n", "2. We loop through the list of URIs as many times as the total number of URIs stored in the `@sameAs` attribute (lines 3-7). We record the URIs in the variable 'same_as_uri' (line 5).\n", "3. Finally, we add a triple to the RDF graph at each loop (line 6): the subject of the RDF statement is the pplace entity, the predicate is `owl:sameAs`, the object is the URI retrieved from within the `@sameAs` attribute." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "def place_sameas(place):\n", " same_as = place.get('sameAs').split()\n", " i = 0\n", " while i < len(same_as):\n", " same_as_uri = URIRef(same_as[i])\n", " g.add( (place_uri, OWL.sameAs, same_as_uri))\n", " i += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step is to provide each place entity with a human-readable label:\n", "1. We look for the `<placeName>` element within `<place>` (line 2).\n", "2. We store the content of such an element in a 'label' variable, as well as see if an `@xml:lang` attribute is also present (line 3-4).\n", "3. If an `@xml:lang` is found, the script adds an RDF triple. The subject of such a triple is the place entity, the predicate is `rdf:label`, and the object is a literal value (i.e. an `xsd:string`). A language declaration is also attached to the triple (lines 5-6).\n", "4. If no `@xml:lang` is found, the script creates an RDF triple whithout declaring any specific language (lines 7-8)." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "def placename(place):\n", " placename = place.find('./tei:placeName', tei)\n", " label = placename.text\n", " label_lang = placename.get('{http://www.w3.org/XML/1998/namespace}lang')\n", " if label_lang is not None:\n", " g.add( (place_uri, RDFS.label, Literal(label, lang=label_lang)))\n", " else:\n", " g.add( (place_uri, RDFS.label, Literal(label)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We may also be interested in extracting all references to a particular place in the text. The following script does precisely this:\n", "1. It looks for any reference to the place, i.e. any `<placeName>` element in the text whose `@ref` attributes corresponds to the `@xml:id` of the place (lines 2-3).\n", "2. It retrieves the parent element of the `<placeName>` and creates a unique URI for it (lines 4-6).\n", "3. It adds an RDF statement which has the person as a subject, followed by the property `dcterms:isReferencedBy`, and the parent element's URI (line 7).\n", "4. It adds two RDF statements describing the parent element's entity, which is a [`frbroo:F23_Expression_Fragment`](http://iflastandards.info/ns/fr/frbr/frbroo/F23) that is part of the TEI file (lines 8-9)." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "def referenced_place(place_id):\n", " ref = './/tei:placeName[@ref=\"#' + place_id + '\"]'\n", " for referenced_place in root.findall(ref, tei):\n", " parent = referenced_place.getparent()\n", " parent_id = parent.get('{http://www.w3.org/XML/1998/namespace}id')\n", " parent_uri = URIRef(base_uri + '/text/' + parent_id)\n", " g.add( (place_uri, DCTERMS.isReferencedBy, parent_uri))\n", " g.add( (parent_uri, RDF.type, frbroo.F23_Expression_Fragment))\n", " g.add( (parent_uri, frbroo.R15i_is_fragment_of, URIRef(base_uri + '/' + edition_id)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we call the functions iterating through each `<place>` element:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "for place in root.findall('.//tei:place', tei):\n", " place_id = place.get('{http://www.w3.org/XML/1998/namespace}id')\n", " place_uri = URIRef(base_uri + '/place/' + place_id)\n", " place_ref = '#' + place_id\n", " subject(place)\n", " place_sameas(place)\n", " placename(place)\n", " referenced_place(place_id)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### 3.7 Printing out to a file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following instructions print the RDF graph to external files. Beside the serialization, you can specify a destination as follows:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# RDF/XML output\n", "g.serialize(destination=\"output.xml\", format='xml')\n", "\n", "# Notation3 output\n", "g.serialize(destination=\"output.n3\", format='n3')\n", "\n", "# N-triples output\n", "g.serialize(destination=\"output.nt\", format='nt')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.16" } }, "nbformat": 4, "nbformat_minor": 2 }