"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"A.table(results, start=44, end=53, skipCols=\"1 2\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can show the results more fully with `show()`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-24T07:47:06.875859Z",
"start_time": "2018-05-24T07:47:06.757345Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"line 1"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 2"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 3"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 4"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 5"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 6"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 7"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 8"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"A.show(results, skipCols=\"1 2 3\", condensed=True, condenseType=\"line\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we pick all numerical words, or rather, words that contain a digit"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-24T07:46:55.998382Z",
"start_time": "2018-05-24T07:46:55.137956Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 2.97s 11 results\n"
]
},
{
"data": {
"text/html": [
"line 1"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 2"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 3"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 4"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
n=7
Daghregisters
trans=Daghregisters
huurschip
trans=huurschip
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"query = \"\"\"\n",
"volume n=4\n",
" page n=239\n",
" line n<9\n",
" word trans~[0-9]\n",
"\"\"\"\n",
"results = A.search(query)\n",
"A.show(results, skipCols=\"1 2 3\", condensed=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets look for all places where there is a remark by the editor:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-24T07:46:55.998382Z",
"start_time": "2018-05-24T07:46:55.137956Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 2.63s 2349087 results\n"
]
}
],
"source": [
"query = \"\"\"\n",
"word isremark\n",
"\"\"\"\n",
"results = A.search(query)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can narrow down to the page we just inspected:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-24T07:46:55.998382Z",
"start_time": "2018-05-24T07:46:55.137956Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 1.72s 198 results\n"
]
}
],
"source": [
"query = \"\"\"\n",
"volume n=4\n",
" page n=239\n",
" word isremark\n",
"\"\"\"\n",
"results = A.search(query)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and show the results:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"line 1"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 2"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 3"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 4"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 5"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 6"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 7"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 8"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 9"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 10"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 11"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 12"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 13"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 14"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 15"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 16"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 17"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 18"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"A.show(results, condensed=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Special characters\n",
"\n",
"How can we look for special characters?\n",
"\n",
"Let's first see what special characters we have in the corpus."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"TF-app: ~/github/clariah/wp6-missieven/app"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"data: ~/text-fabric-data/github/clariah/wp6-missieven/tf/1.0"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Text-Fabric: Text-Fabric API 10.2.6, clariah/wp6-missieven/app v3, Search Reference
Data: WP6-MISSIEVEN, Character table, Feature docs
Features:
\n",
"General Missives Dutch East India Company 1600-1800
\n",
" \n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
authors of the letter, surnames only\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
authors of the letter, full names\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
column number of a column in a row in a table\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
day part of the date of the letter\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
whether a word is the denominator in fraction, e.g. 4 in 1/4\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
whether a word is emphasized by typography\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
a folio reference\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
whether a word belongs to footnote text\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
whether a word is the numerator in fraction, e.g. 1 in 1/4\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
whether a word belongs to original text\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
whether a word is a numerical fraction, e.g. 1/4\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
whether a word belongs to the text of reference\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
whether a word belongs to the text of editorial remarks\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
whether a word has special typography possibly with OCR mistakes as well\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
whether a word has subscript typography possibly indicating the denominator of a fraction\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
whether a word has superscript typography possibly indicating the numerator of a fraction\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
whether a word is underlined by typography\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
footnote mark (not necessarily the same as shown on the printed page\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
month part of the date of the letter\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
number of a volume, letter, page, para, line, table\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
number of the first page of this letter in this volume\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
place from where the letter was sent\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
punctuation and/or whitespace following a wordup to the next word\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
punctuation and/or whitespace following a word,up to the next word, footnote text only\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
punctuation and/or whitespace following a word,up to the next word, original text only\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
punctuation and/or whitespace following a word,up to the next word, remark text only\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
the date the letter was sent\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
row number of a row of column in a table\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
('sequence number of this letter among the letters of the same author in this volume',)\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
status of the letter, e.g. secret, copy\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
title of the letter\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
transcription of a word\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
transcription of a word, only for footnote text\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
transcription of a word, only for original text\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
transcription of a word, only for remark text\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
volume number\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
the page-specific part of web links for page nodes\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
column offset of a column in a row in a table\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
year part of the date of the letter\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
none
\n",
"\n",
"
edge between a word and the footnotes associated with it\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
none
\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
" \n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"A = use(\"clariah/wp6-missieven:clone\", hoist=globals())"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Special characters in text-orig-full
\n",
"\n",
"\n",
"à
\n",
"á
\n",
"â
\n",
"ä
\n",
"ç
\n",
"È
\n",
"è
\n",
"É
\n",
"é
\n",
"Ê
\n",
"ê
\n",
"Ë
\n",
"ë
\n",
"Ï
\n",
"ï
\n",
"Ó
\n",
"ó
\n",
"Ö
\n",
"ö
\n",
"Ü
\n",
"ü
\n",
"£
\n",
"§
\n",
"©
\n",
"«
\n",
"¬
\n",
"®
\n",
"°
\n",
"±
\n",
"»
\n",
"¼
\n",
"½
\n",
"æ
\n",
"ƒ
\n",
"—
\n",
"‘
\n",
"’
\n",
"“
\n",
"”
\n",
"„
\n",
"•
\n",
"…
\n",
"€
\n",
"™
\n",
"∪
\n",
"≤
\n",
"≥
\n",
"⌊
\n",
"⌋
\n",
"■
\n",
"►
\n",
"♦
\n",
"
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"A.specialCharacters()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you click on a character it is copied to the clipboard.\n",
"\n",
"We can search for all words with a black square:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 2.53s 37 results\n"
]
}
],
"source": [
"results = A.search(\"\"\"\n",
"word trans~■\n",
"\"\"\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"n | p | line | word |
\n",
"1 | 1 31:32 | | ■ |
\n",
"2 | 1 80:27 | | ■willen |
\n",
"3 | 1 557:3 | | ■ |
\n",
"4 | 2 641:42 | | ■voorsz. |
\n",
"5 | 2 660:41 | | ■ |
\n",
"6 | 3 118:14 | | ■witste |
\n",
"7 | 3 662:13 | | ■ffi. |
\n",
"8 | 4 205:4 | | ■ . . |
\n",
"9 | 4 208:43 | | ■ |
\n",
"10 | 4 209:7 | | ■ ' |
\n",
"11 | 4 758:24 | | ■ » |
\n",
"12 | 5 336:5 | | ■ |
\n",
"13 | 5 837:28 | | „■ |
\n",
"14 | 6 375:16 | | ■naar |
\n",
"15 | 7 489:38 | | ■ - |
\n",
"16 | 7 622:2 | | ■tg |
\n",
"17 | 8 66:34 | | ■ |
\n",
"18 | 8 66:41 | | ■ |
\n",
"19 | 9 88:17 | | ■ |
\n",
"20 | 9 88:23 | | ■ |
\n",
"21 | 9 94:2 | | ■ |
\n",
"22 | 9 94:5 | | ■ |
\n",
"23 | 9 94:9 | | ■ |
\n",
"24 | 9 94:15 | | ■ |
\n",
"25 | 9 94:17 | | ■ |
\n",
"26 | 9 94:22 | | ■ |
\n",
"27 | 9 94:25 | | ■ |
\n",
"28 | 9 803:16 | | ■' |
\n",
"29 | 9 803:29 | | ■^ |
\n",
"30 | 11 434:20 | | ■ / |
\n",
"31 | 12 436:7 | | ■ |
\n",
"32 | 12 436:10 | | ■ |
\n",
"33 | 12 436:11 | | ■ |
\n",
"34 | 12 436:17 | | ■ |
\n",
"35 | 12 436:21 | | ■ |
\n",
"36 | 13 42:4 | | ■ |
\n",
"37 | 13 194:45 | | ■ » |
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"A.table(results, condensed=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"---\n",
"\n",
"# Contents\n",
"\n",
"* **[start](start.ipynb)** start computing with this corpus\n",
"* **search** turbo charge your hand-coding with search templates\n",
"* **[compute](compute.ipynb)** sink down a level and compute it yourself\n",
"* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results\n",
"* **[annotate](annotate.ipynb)** export text, annotate with BRAT, import annotations\n",
"* **[share](share.ipynb)** draw in other people's data and let them use yours\n",
"* **[entities](entities.ipynb)** use results of third-party NER (named entity recognition)\n",
"* **[porting](porting.ipynb)** port features made against an older version to a newer version\n",
"* **[volumes](volumes.ipynb)** work with selected volumes only\n",
"\n",
"CC-BY Dirk Roorda"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.7"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}