{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Preparing participant-tracking data set for analysis\n", "\n", "*Christian Højgaard Jensen, chj@dbi.edu*\n", "\n", "This notebook contains the scripts necessary for preparing [Eep Talstra's dataset](https://github.com/ch-jensen/Talstra-participant-tracking/blob/master/lev17to26.PredFrCSV) for analysis. The file resembles a semicolon-separated CSV-file but needs to stripped for superflous white spaces.\n", "\n", "The major part of the notebook contains a mapping of the dataset with the clause atom nodes and word nodes of the [ETCBC database of the Hebrew Bible](https://github.com/ETCBC/bhsa). The mapping allows for validating the quality of the dataset as well as combining the data with the data of the ETCBC database." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os, sys\n", "import csv, re\n", "import pandas as pd\n", "import copy\n", "import pprint as pp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Importing the dataset" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Length of dataset: 4092\n" ] } ], "source": [ "filename = 'lev17to26.PredFrCSV'\n", "\n", "new_dict = {}\n", "\n", "n = 0\n", "with open(filename) as f:\n", " next(f)\n", " reader = csv.reader(f, delimiter=';')\n", " for r in reader:\n", " if r[0].strip(' ') != 'PTC':\n", " ref = r[0].strip(' ')\n", " surface_text = r[1].lstrip(' ').rstrip(' ')\n", " book = r[2].strip(' ')\n", " chapter = r[3].strip(' ')\n", " verse = r[4].strip(' ')\n", " line = r[5].strip(' ')\n", " pred = r[6].strip(' ')\n", " VPhr = r[7].lstrip(' ').rstrip(' ')\n", " ptc_lex = r[8].lstrip(' ').rstrip(' ')\n", " ptc_actor = r[9].lstrip(' ').rstrip(' ')\n", " first_lex = r[10].strip(' ')\n", " last_lex = r[11].strip(' ')\n", " const_parsing = r[12].lstrip(' ').rstrip(' ') #Constituent parsing\n", " n+=1\n", "\n", " new_dict[n] = [ref, surface_text, book, chapter, verse, line, pred, VPhr,\n", " ptc_lex, ptc_actor, first_lex, last_lex, const_parsing]\n", "\n", "print(f'Length of dataset: {len(new_dict)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sample of dataset:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | ref | \n", "surface text | \n", "book | \n", "chapter | \n", "verse | \n", "line | \n", "pred | \n", "ref lex | \n", "part. set | \n", "actor | \n", "first slot | \n", "last slot | \n", "func | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "1 | \n", "JDBR | \n", "leviticus | \n", "17 | \n", "1 | \n", "1 | \n", "DBR | \n", "DBR | \n", "3sm=JHWH | \n", "JHWH | \n", "2 | \n", "2 | \n", "VbPred | \n", "
2 | \n", "2 | \n", "JHWH | \n", "leviticus | \n", "17 | \n", "1 | \n", "1 | \n", "DBR | \n", "JHWH | \n", "3sm=JHWH | \n", "JHWH | \n", "3 | \n", "3 | \n", "Subj | \n", "
3 | \n", "3 | \n", ">L MCH | \n", "leviticus | \n", "17 | \n", "1 | \n", "1 | \n", "DBR | \n", ">L MCH | \n", "0sm=MCH | \n", "MCH | \n", "4 | \n", "5 | \n", "Compl1 | \n", "
4 | \n", "4 | \n", "L->MR | \n", "leviticus | \n", "17 | \n", "1 | \n", "2 | \n", ">MR | \n", "L >MR | \n", "3sm=JHWH | \n", "JHWH | \n", "1 | \n", "2 | \n", "VbPred | \n", "
5 | \n", "5 | \n", "DBR | \n", "leviticus | \n", "17 | \n", "2 | \n", "3 | \n", "DBR | \n", "DBR | \n", "2sm= | \n", "MCH | \n", "1 | \n", "1 | \n", "VbPred | \n", "
6 | \n", "6 | \n", ">L >HRN W->L BNJW W->L KL BNJ JFR>L | \n", "leviticus | \n", "17 | \n", "2 | \n", "3 | \n", "DBR | \n", ">L >HRN W >L BN+S W >L KL BN JFR>L | \n", "3pm=>HRN BN+>HRN FR>L | \n", ">HRN BN >HRN | \n", "2 | \n", "11 | \n", "Compl1 | \n", "
7 | \n", "7 | \n", ">L >HRN W->L BNJW | \n", "leviticus | \n", "17 | \n", "2 | \n", "3 | \n", "DBR | \n", ">L >HRN W >L BN+S | \n", "... | \n", "... | \n", "2 | \n", "6 | \n", "-paral | \n", "
8 | \n", "8 | \n", ">L >HRN | \n", "leviticus | \n", "17 | \n", "2 | \n", "3 | \n", "DBR | \n", ">L >HRN | \n", "3sm=>HRN | \n", ">HRN | \n", "2 | \n", "3 | \n", "-paral | \n", "
9 | \n", "9 | \n", ">L BNJW | \n", "leviticus | \n", "17 | \n", "2 | \n", "3 | \n", "DBR | \n", ">L BN+312 | \n", "... | \n", "... | \n", "5 | \n", "6 | \n", "-paral | \n", "
10 | \n", "10 | \n", "sfx:W | \n", "leviticus | \n", "17 | \n", "2 | \n", "3 | \n", "DBR | \n", "sfx | \n", "3sm=>HRN | \n", ">HRN | \n", "6 | \n", "6 | \n", "-gentf | \n", "