{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "---\n", "\n", "To get started: consult [start](start.ipynb)\n", "\n", "---\n", "\n", "# Search Introduction\n", "\n", "*Search* in Text-Fabric is a template based way of looking for structural patterns in your dataset.\n", "\n", "Within Text-Fabric we have the unique possibility to combine the ease of formulating search templates for\n", "complicated syntactical patterns with the power of programmatically processing the results.\n", "\n", "This notebook will show you how to get up and running.\n", "\n", "## Easy command\n", "\n", "Search is as simple as saying (just an example)\n", "\n", "```python\n", "results = A.search(template)\n", "A.show(results)\n", "```\n", "\n", "See all ins and outs in the\n", "[search template docs](https://annotation.github.io/text-fabric/tf/about/searchusage.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Incantation\n", "\n", "The ins and outs of installing Text-Fabric, getting the corpus, and initializing a notebook are\n", "explained in the [start tutorial](start.ipynb)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T10:06:39.818664Z", "start_time": "2018-05-24T10:06:39.796588Z" } }, "outputs": [], "source": [ "from tf.app import use" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "TF-app: ~/text-fabric-data/github/clariah/wp6-missieven/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/clariah/wp6-missieven/tf/1.0" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Text-Fabric: Text-Fabric API 10.2.6, clariah/wp6-missieven/app v3, Search Reference
Data: WP6-MISSIEVEN, Character table, Feature docs
Features:
\n", "
General Missives Dutch East India Company 1600-1800\n", "
\n", "\n", "
\n", "
\n", "author\n", "
\n", "
str
\n", "\n", " authors of the letter, surnames only\n", "\n", "
\n", "\n", "
\n", "
\n", "authorFull\n", "
\n", "
str
\n", "\n", " authors of the letter, full names\n", "\n", "
\n", "\n", "
\n", "
\n", "col\n", "
\n", "
int
\n", "\n", " column number of a column in a row in a table\n", "\n", "
\n", "\n", "
\n", "
\n", "day\n", "
\n", "
int
\n", "\n", " day part of the date of the letter\n", "\n", "
\n", "\n", "
\n", "
\n", "isden\n", "
\n", "
int
\n", "\n", " whether a word is the denominator in fraction, e.g. 4 in 1/4\n", "\n", "
\n", "\n", "
\n", "
\n", "isemph\n", "
\n", "
str
\n", "\n", " whether a word is emphasized by typography\n", "\n", "
\n", "\n", "
\n", "
\n", "isfolio\n", "
\n", "
int
\n", "\n", " a folio reference\n", "\n", "
\n", "\n", "
\n", "
\n", "isnote\n", "
\n", "
int
\n", "\n", " whether a word belongs to footnote text\n", "\n", "
\n", "\n", "
\n", "
\n", "isnum\n", "
\n", "
int
\n", "\n", " whether a word is the numerator in fraction, e.g. 1 in 1/4\n", "\n", "
\n", "\n", "
\n", "
\n", "isorig\n", "
\n", "
int
\n", "\n", " whether a word belongs to original text\n", "\n", "
\n", "\n", "
\n", "
\n", "isq\n", "
\n", "
int
\n", "\n", " whether a word is a numerical fraction, e.g. 1/4\n", "\n", "
\n", "\n", "
\n", "
\n", "isref\n", "
\n", "
int
\n", "\n", " whether a word belongs to the text of reference\n", "\n", "
\n", "\n", "
\n", "
\n", "isremark\n", "
\n", "
int
\n", "\n", " whether a word belongs to the text of editorial remarks\n", "\n", "
\n", "\n", "
\n", "
\n", "isspecial\n", "
\n", "
int
\n", "\n", " whether a word has special typography possibly with OCR mistakes as well\n", "\n", "
\n", "\n", "
\n", "
\n", "issub\n", "
\n", "
int
\n", "\n", " whether a word has subscript typography possibly indicating the denominator of a fraction\n", "\n", "
\n", "\n", "
\n", "
\n", "issuper\n", "
\n", "
int
\n", "\n", " whether a word has superscript typography possibly indicating the numerator of a fraction\n", "\n", "
\n", "\n", "
\n", "
\n", "isund\n", "
\n", "
str
\n", "\n", " whether a word is underlined by typography\n", "\n", "
\n", "\n", "
\n", "
\n", "mark\n", "
\n", "
int
\n", "\n", " footnote mark (not necessarily the same as shown on the printed page\n", "\n", "
\n", "\n", "
\n", "
\n", "month\n", "
\n", "
int
\n", "\n", " month part of the date of the letter\n", "\n", "
\n", "\n", "
\n", "
\n", "n\n", "
\n", "
int
\n", "\n", " number of a volume, letter, page, para, line, table\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "page\n", "
\n", "
str
\n", "\n", " number of the first page of this letter in this volume\n", "\n", "
\n", "\n", "
\n", "
\n", "place\n", "
\n", "
str
\n", "\n", " place from where the letter was sent\n", "\n", "
\n", "\n", "
\n", "
\n", "punc\n", "
\n", "
str
\n", "\n", " punctuation and/or whitespace following a wordup to the next word\n", "\n", "
\n", "\n", "
\n", "
\n", "puncn\n", "
\n", "
str
\n", "\n", " punctuation and/or whitespace following a word,up to the next word, footnote text only\n", "\n", "
\n", "\n", "
\n", "
\n", "punco\n", "
\n", "
str
\n", "\n", " punctuation and/or whitespace following a word,up to the next word, original text only\n", "\n", "
\n", "\n", "
\n", "
\n", "puncr\n", "
\n", "
str
\n", "\n", " punctuation and/or whitespace following a word,up to the next word, remark text only\n", "\n", "
\n", "\n", "
\n", "
\n", "rawdate\n", "
\n", "
str
\n", "\n", " the date the letter was sent\n", "\n", "
\n", "\n", "
\n", "
\n", "row\n", "
\n", "
int
\n", "\n", " row number of a row of column in a table\n", "\n", "
\n", "\n", "
\n", "
\n", "seq\n", "
\n", "
str
\n", "\n", " ('sequence number of this letter among the letters of the same author in this volume',)\n", "\n", "
\n", "\n", "
\n", "
\n", "status\n", "
\n", "
str
\n", "\n", " status of the letter, e.g. secret, copy\n", "\n", "
\n", "\n", "
\n", "
\n", "title\n", "
\n", "
str
\n", "\n", " title of the letter\n", "\n", "
\n", "\n", "
\n", "
\n", "trans\n", "
\n", "
str
\n", "\n", " transcription of a word\n", "\n", "
\n", "\n", "
\n", "
\n", "transn\n", "
\n", "
str
\n", "\n", " transcription of a word, only for footnote text\n", "\n", "
\n", "\n", "
\n", "
\n", "transo\n", "
\n", "
str
\n", "\n", " transcription of a word, only for original text\n", "\n", "
\n", "\n", "
\n", "
\n", "transr\n", "
\n", "
str
\n", "\n", " transcription of a word, only for remark text\n", "\n", "
\n", "\n", "
\n", "
\n", "vol\n", "
\n", "
int
\n", "\n", " volume number\n", "\n", "
\n", "\n", "
\n", "
\n", "weblink\n", "
\n", "
str
\n", "\n", " the page-specific part of web links for page nodes\n", "\n", "
\n", "\n", "
\n", "
\n", "x\n", "
\n", "
int
\n", "\n", " column offset of a column in a row in a table\n", "\n", "
\n", "\n", "
\n", "
\n", "year\n", "
\n", "
int
\n", "\n", " year part of the date of the letter\n", "\n", "
\n", "\n", "
\n", "
\n", "note\n", "
\n", "
none
\n", "\n", " edge between a word and the footnotes associated with it\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Text-Fabric API: names N F E L T S C TF directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A = use(\"clariah/wp6-missieven\", hoist=globals())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Basic search command\n", "\n", "We start with the most simple form of issuing a query.\n", "Let's look for the words in volume 4, page 235, line 17\n", "\n", "All work involved in searching takes place under the hood." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T07:46:55.998382Z", "start_time": "2018-05-24T07:46:55.137956Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.80s 61 results\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
npword
14 239:1IV.
24 239:1RYCKLOFF
34 239:1VAN
44 239:1GOENS,
54 239:1CORNELIS
64 239:1SPEELMAN,
74 239:1WILLEM
84 239:1VAN
94 239:2OUTHOORN,
104 239:2JOANNES
114 239:2CAMPHUYS
124 239:2EN
134 239:2JACOB
144 239:2PITS,
154 239:2BATAVIA
164 239:23
174 239:2augustus
184 239:31678.
194 239:41212,
204 239:4fol.
214 239:4699-
224 239:4706,
234 239:4copie
244 239:41220,
254 239:4fol.
264 239:496-
274 239:4109
284 239:5Met
294 239:5enige
304 239:5Engelse
314 239:5schepen
324 239:5uit
334 239:5Bantam
344 239:5verzonden. .
354 239:6
364 239:7Vgl.
374 239:7Daghregisters
384 239:71678,
394 239:7p.
404 239:7189-
414 239:7421
424 239:7het
434 239:7huurschip
444 239:7St.
454 239:7Andries
464 239:7kwam
474 239:7in
484 239:8slechte
494 239:8staat
504 239:8uit
514 239:8patria
524 239:8te
534 239:8Batavia
544 239:8en
554 239:8wordX
564 239:8op
574 239:8Onrust
584 239:8gerepareerd;
594 239:8grote
604 239:8sterfte
614 239:8op
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "query = \"\"\"\n", "volume n=4\n", " page n=239\n", " line n<9\n", " word\n", "\"\"\"\n", "results = A.search(query)\n", "A.table(results, skipCols=\"1 2 3\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The hyperlinks take us to the online image of this page at the Huygens institute.\n", "\n", "Note that we can choose start and/or end points in the results list." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T07:47:03.299872Z", "start_time": "2018-05-24T07:47:03.261873Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
nplineword
444 239:7St.
454 239:7Andries
464 239:7kwam
474 239:7in
484 239:8slechte
494 239:8staat
504 239:8uit
514 239:8patria
524 239:8te
534 239:8Batavia
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table(results, start=44, end=53, skipCols=\"1 2\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can show the results more fully with `show()`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T07:47:06.875859Z", "start_time": "2018-05-24T07:47:06.757345Z" } }, "outputs": [ { "data": { "text/html": [ "

line 1" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=1
IV.
RYCKLOFF
VAN
GOENS,
CORNELIS
SPEELMAN,
WILLEM
VAN
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 2" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=2
OUTHOORN,
JOANNES
CAMPHUYS
EN
JACOB
PITS,
BATAVIA
3
augustus
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 3" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=3
1678.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 4" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=4
folio
1212,
fol.
699-
706,
copie
1220,
fol.
96-
109
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 5" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=5
Met
enige
Engelse
schepen
uit
Bantam
verzonden. .
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 6" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=6
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 7" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=7
Vgl.
Daghregisters
1678,
p.
189-
421
het
huurschip
St.
Andries
kwam
in
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 8" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=8
slechte
staat
uit
patria
te
Batavia
en
wordX
op
Onrust
gerepareerd;
grote
sterfte
op
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.show(results, skipCols=\"1 2 3\", condensed=True, condenseType=\"line\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we pick all numerical words, or rather, words that contain a digit" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T07:46:55.998382Z", "start_time": "2018-05-24T07:46:55.137956Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2.97s 11 results\n" ] }, { "data": { "text/html": [ "

line 1" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=2
OUTHOORN,
trans=OUTHOORN
JOANNES
trans=JOANNES
CAMPHUYS
trans=CAMPHUYS
EN
trans=EN
JACOB
trans=JACOB
PITS,
trans=PITS
BATAVIA
trans=BATAVIA
3
trans=3
augustus
trans=augustus
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 2" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=3
1678.
trans=1678
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 3" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=4
folio
1212,
trans=1212
fol.
trans=fol
699-
trans=699
706,
trans=706
copie
trans=copie
1220,
trans=1220
fol.
trans=fol
96-
trans=96
109
trans=109
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 4" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=7
Vgl.
trans=Vgl
Daghregisters
trans=Daghregisters
1678,
trans=1678
p.
trans=p
189-
trans=189
421
trans=421
het
trans=het
huurschip
trans=huurschip
St.
trans=St
Andries
trans=Andries
kwam
trans=kwam
in
trans=in
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "query = \"\"\"\n", "volume n=4\n", " page n=239\n", " line n<9\n", " word trans~[0-9]\n", "\"\"\"\n", "results = A.search(query)\n", "A.show(results, skipCols=\"1 2 3\", condensed=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets look for all places where there is a remark by the editor:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T07:46:55.998382Z", "start_time": "2018-05-24T07:46:55.137956Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2.63s 2349087 results\n" ] } ], "source": [ "query = \"\"\"\n", "word isremark\n", "\"\"\"\n", "results = A.search(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can narrow down to the page we just inspected:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T07:46:55.998382Z", "start_time": "2018-05-24T07:46:55.137956Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.72s 198 results\n" ] } ], "source": [ "query = \"\"\"\n", "volume n=4\n", " page n=239\n", " word isremark\n", "\"\"\"\n", "results = A.search(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and show the results:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

line 1" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=7
Vgl.
isremark=1
Daghregisters
isremark=1
1678,
isremark=1
p.
isremark=1
189-
isremark=1
421
isremark=1
het
isremark=1
huurschip
isremark=1
St.
isremark=1
Andries
isremark=1
kwam
isremark=1
in
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 2" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=8
slechte
isremark=1
staat
isremark=1
uit
isremark=1
patria
isremark=1
te
isremark=1
Batavia
isremark=1
en
isremark=1
wordX
isremark=1
op
isremark=1
Onrust
isremark=1
gerepareerd;
isremark=1
grote
isremark=1
sterfte
isremark=1
op
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 3" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=9
Temate;
isremark=1
de
isremark=1
Koning
isremark=1
veroverde
isremark=1
Siau, »
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 4" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=15
De
isremark=1
contanten
isremark=1
uit
isremark=1
Europa
isremark=1
worden
isremark=1
naar
isremark=1
Coromandel
isremark=1
gezonden;
isremark=1
gewapend
isremark=1
optreden
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 5" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=16
in
isremark=1
Mataram
isremark=1
noodzakelijker
isremark=1
dan
isremark=1
tegen
isremark=1
Bantam;
isremark=1
Speelman
isremark=1
zou
isremark=1
graag
isremark=1
daarvoor
isremark=1
gebruikt
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 6" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=17
zijn,
isremark=1
maar
isremark=1
kan
isremark=1
als
isremark=1
Directeur-
isremark=1
Generaal
isremark=1
niet
isremark=1
van
isremark=1
Batavia
isremark=1
gemist
isremark=1
worden;
isremark=1
gegevens
isremark=1
over
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 7" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=18
Java,
isremark=1
vgl.
isremark=1
Opkomst
isremark=1
VII,
isremark=1
p.
isremark=1
LVII-
isremark=1
LXV;
isremark=1
het
isremark=1
werd
isremark=1
aan
isremark=1
Aru
isremark=1
Palakka
isremark=1
overgelaten
isremark=1
te
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 8" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=19
beslissen,
isremark=1
of
isremark=1
hij
isremark=1
met
isremark=1
zijn
isremark=1
troepen
isremark=1
mee
isremark=1
naar
isremark=1
Java
isremark=1
wil
isremark=1
komen;
isremark=1
er
isremark=1
is
isremark=1
te
isremark=1
Batavia
isremark=1
evenmin
isremark=1
als
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 9" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=20
bij
isremark=1
de
isremark=1
Engelsen
isremark=1
te
isremark=1
Bantam
isremark=1
peper
isremark=1
in
isremark=1
voorraad;
isremark=1
een
isremark=1
geschikte
isremark=1
commissaris
isremark=1
voor
isremark=1
Coromandel
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 10" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=21
is
isremark=1
niet
isremark=1
aanwezig »
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 11" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=28
Dergelijke
isremark=1
projecten
isremark=1
worden
isremark=1
voor
isremark=1
de
isremark=1
andere
isremark=1
vestigingen
isremark=1
opgemaakt;
isremark=1
door
isremark=1
het
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 12" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=29
sluiten
isremark=1
van
isremark=1
de
isremark=1
vrede
isremark=1
neemt
isremark=1
de
isremark=1
areca-
isremark=1
en
isremark=1
textielhandel
isremark=1
in
isremark=1
Zuid-
isremark=1
India
isremark=1
toe;
isremark=1
de
isremark=1
fortificatie
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 13" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=30
op
isremark=1
Ceylon
isremark=1
is
isremark=1
gereed;
isremark=1
verzoek
isremark=1
aan
isremark=1
Heren
isremark=1
XVII
isremark=1
daar
isremark=1
geen
isremark=1
veranderingen
isremark=1
in
isremark=1
te
isremark=1
voeren,
isremark=1
voor
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 14" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=31
de
isremark=1
Hoge
isremark=1
Regering
isremark=1
daarover
isremark=1
in
isremark=1
nov.
isremark=1
zal
isremark=1
geadviseerd
isremark=1
hebben;
isremark=1
enkele
isremark=1
galjoots
isremark=1
aangevraagd
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 15" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=32
voor
isremark=1
het
isremark=1
overbrengen
isremark=1
van
isremark=1
tussentijdse
isremark=1
adviezen
isremark=1
van
isremark=1
en
isremark=1
naar
isremark=1
patria;
isremark=1
voorlopig
isremark=1
zal
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 16" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=33
Batavia
isremark=1
bezwaard
isremark=1
blijven
isremark=1
met
isremark=1
wegens
isremark=1
overbodigheid
isremark=1
van
isremark=1
elders
isremark=1
opgeroepen
isremark=1
personeel;
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 17" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=34
overwogen,
isremark=1
of »
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line 18" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

line
n=39
Velen
isremark=1
onbekwamen
isremark=1
zijn
isremark=1
reeds
isremark=1
ontslagen
isremark=1
of
isremark=1
in
isremark=1
de
isremark=1
militie
isremark=1
opgenomen »
isremark=1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.show(results, condensed=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Special characters\n", "\n", "How can we look for special characters?\n", "\n", "Let's first see what special characters we have in the corpus." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "TF-app: ~/github/clariah/wp6-missieven/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/clariah/wp6-missieven/tf/1.0" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Text-Fabric: Text-Fabric API 10.2.6, clariah/wp6-missieven/app v3, Search Reference
Data: WP6-MISSIEVEN, Character table, Feature docs
Features:
\n", "
General Missives Dutch East India Company 1600-1800\n", "
\n", "\n", "
\n", "
\n", "author\n", "
\n", "
str
\n", "\n", " authors of the letter, surnames only\n", "\n", "
\n", "\n", "
\n", "
\n", "authorFull\n", "
\n", "
str
\n", "\n", " authors of the letter, full names\n", "\n", "
\n", "\n", "
\n", "
\n", "col\n", "
\n", "
int
\n", "\n", " column number of a column in a row in a table\n", "\n", "
\n", "\n", "
\n", "
\n", "day\n", "
\n", "
int
\n", "\n", " day part of the date of the letter\n", "\n", "
\n", "\n", "
\n", "
\n", "isden\n", "
\n", "
int
\n", "\n", " whether a word is the denominator in fraction, e.g. 4 in 1/4\n", "\n", "
\n", "\n", "
\n", "
\n", "isemph\n", "
\n", "
str
\n", "\n", " whether a word is emphasized by typography\n", "\n", "
\n", "\n", "
\n", "
\n", "isfolio\n", "
\n", "
int
\n", "\n", " a folio reference\n", "\n", "
\n", "\n", "
\n", "
\n", "isnote\n", "
\n", "
int
\n", "\n", " whether a word belongs to footnote text\n", "\n", "
\n", "\n", "
\n", "
\n", "isnum\n", "
\n", "
int
\n", "\n", " whether a word is the numerator in fraction, e.g. 1 in 1/4\n", "\n", "
\n", "\n", "
\n", "
\n", "isorig\n", "
\n", "
int
\n", "\n", " whether a word belongs to original text\n", "\n", "
\n", "\n", "
\n", "
\n", "isq\n", "
\n", "
int
\n", "\n", " whether a word is a numerical fraction, e.g. 1/4\n", "\n", "
\n", "\n", "
\n", "
\n", "isref\n", "
\n", "
int
\n", "\n", " whether a word belongs to the text of reference\n", "\n", "
\n", "\n", "
\n", "
\n", "isremark\n", "
\n", "
int
\n", "\n", " whether a word belongs to the text of editorial remarks\n", "\n", "
\n", "\n", "
\n", "
\n", "isspecial\n", "
\n", "
int
\n", "\n", " whether a word has special typography possibly with OCR mistakes as well\n", "\n", "
\n", "\n", "
\n", "
\n", "issub\n", "
\n", "
int
\n", "\n", " whether a word has subscript typography possibly indicating the denominator of a fraction\n", "\n", "
\n", "\n", "
\n", "
\n", "issuper\n", "
\n", "
int
\n", "\n", " whether a word has superscript typography possibly indicating the numerator of a fraction\n", "\n", "
\n", "\n", "
\n", "
\n", "isund\n", "
\n", "
str
\n", "\n", " whether a word is underlined by typography\n", "\n", "
\n", "\n", "
\n", "
\n", "mark\n", "
\n", "
int
\n", "\n", " footnote mark (not necessarily the same as shown on the printed page\n", "\n", "
\n", "\n", "
\n", "
\n", "month\n", "
\n", "
int
\n", "\n", " month part of the date of the letter\n", "\n", "
\n", "\n", "
\n", "
\n", "n\n", "
\n", "
int
\n", "\n", " number of a volume, letter, page, para, line, table\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "page\n", "
\n", "
str
\n", "\n", " number of the first page of this letter in this volume\n", "\n", "
\n", "\n", "
\n", "
\n", "place\n", "
\n", "
str
\n", "\n", " place from where the letter was sent\n", "\n", "
\n", "\n", "
\n", "
\n", "punc\n", "
\n", "
str
\n", "\n", " punctuation and/or whitespace following a wordup to the next word\n", "\n", "
\n", "\n", "
\n", "
\n", "puncn\n", "
\n", "
str
\n", "\n", " punctuation and/or whitespace following a word,up to the next word, footnote text only\n", "\n", "
\n", "\n", "
\n", "
\n", "punco\n", "
\n", "
str
\n", "\n", " punctuation and/or whitespace following a word,up to the next word, original text only\n", "\n", "
\n", "\n", "
\n", "
\n", "puncr\n", "
\n", "
str
\n", "\n", " punctuation and/or whitespace following a word,up to the next word, remark text only\n", "\n", "
\n", "\n", "
\n", "
\n", "rawdate\n", "
\n", "
str
\n", "\n", " the date the letter was sent\n", "\n", "
\n", "\n", "
\n", "
\n", "row\n", "
\n", "
int
\n", "\n", " row number of a row of column in a table\n", "\n", "
\n", "\n", "
\n", "
\n", "seq\n", "
\n", "
str
\n", "\n", " ('sequence number of this letter among the letters of the same author in this volume',)\n", "\n", "
\n", "\n", "
\n", "
\n", "status\n", "
\n", "
str
\n", "\n", " status of the letter, e.g. secret, copy\n", "\n", "
\n", "\n", "
\n", "
\n", "title\n", "
\n", "
str
\n", "\n", " title of the letter\n", "\n", "
\n", "\n", "
\n", "
\n", "trans\n", "
\n", "
str
\n", "\n", " transcription of a word\n", "\n", "
\n", "\n", "
\n", "
\n", "transn\n", "
\n", "
str
\n", "\n", " transcription of a word, only for footnote text\n", "\n", "
\n", "\n", "
\n", "
\n", "transo\n", "
\n", "
str
\n", "\n", " transcription of a word, only for original text\n", "\n", "
\n", "\n", "
\n", "
\n", "transr\n", "
\n", "
str
\n", "\n", " transcription of a word, only for remark text\n", "\n", "
\n", "\n", "
\n", "
\n", "vol\n", "
\n", "
int
\n", "\n", " volume number\n", "\n", "
\n", "\n", "
\n", "
\n", "weblink\n", "
\n", "
str
\n", "\n", " the page-specific part of web links for page nodes\n", "\n", "
\n", "\n", "
\n", "
\n", "x\n", "
\n", "
int
\n", "\n", " column offset of a column in a row in a table\n", "\n", "
\n", "\n", "
\n", "
\n", "year\n", "
\n", "
int
\n", "\n", " year part of the date of the letter\n", "\n", "
\n", "\n", "
\n", "
\n", "note\n", "
\n", "
none
\n", "\n", " edge between a word and the footnotes associated with it\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Text-Fabric API: names N F E L T S C TF directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A = use(\"clariah/wp6-missieven:clone\", hoist=globals())" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

Special characters in text-orig-full

\n", "

\n", "\n", "à\n", "á\n", "â\n", "ä\n", "ç\n", "È\n", "è\n", "É\n", "é\n", "Ê\n", "ê\n", "Ë\n", "ë\n", "Ï\n", "ï\n", "Ó\n", "ó\n", "Ö\n", "ö\n", "Ü\n", "ü\n", "£\n", "§\n", "©\n", "«\n", "¬\n", "®\n", "°\n", "±\n", "»\n", "¼\n", "½\n", "æ\n", "ƒ\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.specialCharacters()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you click on a character it is copied to the clipboard.\n", "\n", "We can search for all words with a black square:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2.53s 37 results\n" ] } ], "source": [ "results = A.search(\"\"\"\n", "word trans~■\n", "\"\"\")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
nplineword
11 31:32
21 80:27■willen
31 557:3
42 641:42■voorsz.
52 660:41
63 118:14■witste
73 662:13■ffi.
84 205:4■ . .
94 208:43
104 209:7■ '
114 758:24■ »
125 336:5
135 837:28„■
146 375:16■naar
157 489:38■ -
167 622:2■tg
178 66:34
188 66:41
199 88:17
209 88:23
219 94:2
229 94:5
239 94:9
249 94:15
259 94:17
269 94:22
279 94:25
289 803:16■'
299 803:29■^
3011 434:20■ /
3112 436:7
3212 436:10
3312 436:11
3412 436:17
3512 436:21
3613 42:4
3713 194:45■ »
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table(results, condensed=True)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "---\n", "\n", "# Contents\n", "\n", "* **[start](start.ipynb)** start computing with this corpus\n", "* **search** turbo charge your hand-coding with search templates\n", "* **[compute](compute.ipynb)** sink down a level and compute it yourself\n", "* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results\n", "* **[annotate](annotate.ipynb)** export text, annotate with BRAT, import annotations\n", "* **[share](share.ipynb)** draw in other people's data and let them use yours\n", "* **[entities](entities.ipynb)** use results of third-party NER (named entity recognition)\n", "* **[porting](porting.ipynb)** port features made against an older version to a newer version\n", "* **[volumes](volumes.ipynb)** work with selected volumes only\n", "\n", "CC-BY Dirk Roorda" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.7" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }