Converted from an OASIS Open Document
For the last five years I’ve taught a course on “computer science for historians” at University Lyon3 Jean Moulin. The course includes 20 hours in ten sessions and has attracted twenty to forty students each year. It addresses history students during the first semester of their master studies: they start at this stage with information collection in archival sources and bibliography, which they can later exploit to write their master's thesis. Thus, the aim is to provide them with methods and digital tools for modeling and storing information, and then for subjecting it to interrogation, visualization and analysis. This is a great challenge because many students still use paper for taking notes when they analyze historical sources and are not used to working with software that is not completely self-installing. Furthermore, students will receive from their tutors all kinds of research subjects, from Ancient to Modern history, and they often want to analyze quite complex information which one cannot store in a simple spreadsheet. For this reason, the pedagogical challenge is also a challenge for the digital humanist: how can the students be provided with a generic and flexible information system of ready use for their research but sophisticated enough to store any sort of data?
In this paper I will treat some theoretical and practical aspects of the digital information system I devised to cope with this problem, and will present some issues raised by recourse to the information system that concern both students and teacher. The manuscript of the course is publicly available on a
dedicated website
Text encoding initiative's
guidelines (TEI). These data production practices, both as structured data and encoded texts, must be radically simplified to cope with the pedagogical need exposed above and this requires working on a high level of abstraction.
The first component of the information system provided to the students is a relational database designed using a generic data model
The database is implemented using PostgreSQL because this open-source database provides extended features in datatype treatment (namely XML) and comes along with a procedural language (PL/pgSQL) allowing data treatment in a SQL context without having to learn a different programming language
If the "Objects" ("Endurants" or "Persistent items") are identified in the database, where then is collected information about them? According to the
symogih.org semantic data model
participating in some perdurant(s). For exemple, a person, which is an endurant, may participate in a discussion, which is a perdurant”
In former years, "knowledge units" were also stored in the database, as the "objects" are presently
Some pages of the symogih.org project's user manual provide the encoding specification for XML/TEI semantic annotated texts using the symogih.org ontology:
This method was presented at the TEI 2015 conference in Lyon, cf. Beretta 2015. The Special Interest Group Ontologies in the TEI Consortium is devoted to this approach. See the GIS Ontologies wiki :
But this method raises the question of the XML editor to adopt for text semantic encoding, meaning the addition of a further software component to the workflow of data production providing XML schema validation and also tools for querying the encoded text. XML text encoding is more suitable and I prefer it for PhD student and researcher training, but this demands a supplementary specific instruction that it impossible to provide in the limited master's course time. I therefore conceived a way of semantically tagging the text in a simple text editor or word processing program using curly brackets instead of angle brackets and replacing XML-attributes by predefined codes. This method is described on the
course wiki that also furnishes instructions for using regular expressions for proper encoding
The workflow of data production and treatment ends with the phase of data analysis and visualization. For this purpose I adopted the R software that can be directly connected to a PostgreSQL database and provides many useful libraries. For instance, a former student produced data about relationship between persons attested by medieval charters that can be used for network analysis (Figure 2).
The students can send the teacher a dump of their database and formulate the research questions that the latter will transcribe into SQL, XPath or procedure language queries for extracting data, before sending this back to the students. Building upon these examples the students can themselves adapt the queries and scripts to new research questions. A wiki dedicated to each student's project can be created to document the specific workflow of each research project: it is not public but it is accessible to all other students participating in the process of data production and analysis. The students can use the results of data analysis and visualization to formulate new research hypotheses or they can integrate them into their master's thesis.
In this paper I will present the essential conceptual and technical aspects of the whole workflow and consider three major advantages of this pedagogical approach for the disciplinary domain of digital history. First, students gain the experience of managing a workflow going from installation and personal practice on a solid community maintained open-source software, to reflection on data modeling concerning their own research agenda, to collaborative data and project management through a wiki, to an introduction in data mining and visualization techniques. Secondly, the abstraction level of the data model and text encoding practice proposed to the students implicitly introduces them to knowledge management and data production according to present-day standards like CIDOC-CRM and
Text encoding initiative: from this perspective historical knowledge is modeled as a graph of objects situated in time and space and linked to the texts from which they derive. Thus —and this is the third advantage— the course acquaints students with the basic principles of linked data and of semantic text encoding, introducing them to the concepts and practice of resource sharing and data curation: the datasets I use for the exercises come from the French national library (BNF) SPARQL endpoint and DBPedia, and the texts from Wikipedia. In a final part, I will discuss the issues that this pedagogical approach raises for master's students in history.