"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"A = use(\n",
" \"CLARIAH/wp6-missieven:clone\",\n",
" checkout=\"clone\",\n",
" mod=\"CLARIAH/wp6-missieven/voc-missives/migrated/tf:clone\",\n",
" hoist=globals(),\n",
" version=\"1.0\",\n",
" silent=\"verbose\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that you can click the triangle before *CLARIAH/wp6-missieven/voc-missives/migrated/tf*,\n",
"to see which features are used.\n",
"You can then click further on the triangle before the feature data type, to see more information\n",
"about that feature, including the fact that it is an upgraded feature.\n",
"\n",
"```\n",
"creator: Sophie Arnoult\n",
"dateWritten: 2022-10-11T10:42:45Z\n",
"upgraded: ‼️ from version 0.8.1 to 1.0\n",
"writtenBy: Text-Fabric\n",
"```\n",
"\n",
"We are going to do a bit of research into the upgraded features."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(('e_n12_2_632', 8),\n",
" ('e_n13_15_2306', 8),\n",
" ('e_n7_8_809', 8),\n",
" ('e_n13_15_1302', 7),\n",
" ('e_n7_8_1080', 7),\n",
" ('e_t10_15_108', 7),\n",
" ('e_t10_15_273', 7),\n",
" ('e_n10_11_715', 6),\n",
" ('e_n12_14_130', 6),\n",
" ('e_n12_2_383', 6),\n",
" ('e_n12_2_578', 6),\n",
" ('e_n13_15_154', 6),\n",
" ('e_n13_15_1582', 6),\n",
" ('e_n13_15_1894', 6),\n",
" ('e_n13_15_285', 6),\n",
" ('e_n5_28_103', 6),\n",
" ('e_n5_28_34', 6),\n",
" ('e_n5_28_675', 6),\n",
" ('e_n5_7_515', 6),\n",
" ('e_n8_6_710', 6))"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"F.entityId.freqList()[0:20]"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"24500"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(F.entityId.freqList())"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(('LOC', 12790),\n",
" ('PER', 10393),\n",
" ('LOCderiv', 4279),\n",
" ('ORG', 3841),\n",
" ('SHP', 2922),\n",
" ('GPE', 1153),\n",
" ('RELderiv', 261),\n",
" ('ORGpart', 58),\n",
" ('LOCpart', 45),\n",
" ('RELpart', 28),\n",
" ('REL', 19))"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"F.entityKind.freqList()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 1.96s 32249 results\n"
]
}
],
"source": [
"query = \"\"\"\n",
"word entityId entityKind*\n",
"\"\"\"\n",
"results = A.search(query)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"line 1"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
PIETER
entityId=e_t1_49_0entityKind=PER
DE
entityId=e_t1_49_0entityKind=PER
CARPENTIER,
entityId=e_t1_49_0entityKind=PER
JACOB
entityId=e_t1_49_1entityKind=PER
DEDEL,
entityId=e_t1_49_1entityKind=PER
CORNELIS
entityId=e_t1_49_2entityKind=PER
REYERSZ.
entityId=e_t1_49_2entityKind=PER
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 2"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
ABRAHAM
entityId=e_t1_49_3entityKind=PER
VAN
entityId=e_t1_49_3entityKind=PER
UFFELEN,
entityId=e_t1_49_3entityKind=PER
JAKATRA
entityId=e_t1_49_4entityKind=LOC
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 3"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
Westcuste
entityId=e_t1_49_5entityKind=LOC
van
entityId=e_t1_49_5entityKind=LOC
Sumatra
entityId=e_t1_49_5entityKind=LOC
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 4"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
Maleis
entityId=e_n1_49_0entityKind=LOCderiv
Atjeh.
entityId=e_n1_49_1entityKind=LOC
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 5"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
Atchinder
entityId=e_t1_49_6entityKind=LOCderiv
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 6"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
Atchin
entityId=e_t1_49_7entityKind=LOC
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 7"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
Casembroot
entityId=e_t1_49_8entityKind=PER
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 8"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
Engels
entityId=e_t1_49_9entityKind=LOCderiv
d’Unite
entityId=e_t1_49_10entityKind=SHP
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 9"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
France
entityId=e_t1_49_11entityKind=LOCderiv
Montmorenci
entityId=e_t1_49_12entityKind=SHP
De
entityId=e_n1_49_2entityKind=PER
Montmorency
entityId=e_n1_49_2entityKind=PER
de
entityId=e_n1_49_3entityKind=PER
Beaulieu (
entityId=e_n1_49_3entityKind=PER
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 10"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
Atjeh,
entityId=e_n1_49_4entityKind=LOC
Maleise
entityId=e_n1_49_5entityKind=LOCderiv
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"A.show(results, condensed=True, end=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's view the distribution of named entities over the volumes.\n",
"\n",
"We run a query looking for words with a named entity within a volume."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 2.46s 32249 results\n"
]
}
],
"source": [
"query = \"\"\"\n",
"volume\n",
" word entityId\n",
"\"\"\"\n",
"results = A.search(query)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we process the results, which are tuples consisting of a volume node and a \n",
"word node."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Counter({1: 1451,\n",
" 2: 1150,\n",
" 3: 1536,\n",
" 4: 1758,\n",
" 5: 3032,\n",
" 6: 2864,\n",
" 7: 2343,\n",
" 8: 1685,\n",
" 9: 1870,\n",
" 10: 1933,\n",
" 11: 4695,\n",
" 12: 1909,\n",
" 13: 6023})"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eDist = collections.Counter()\n",
"\n",
"for (vol, word) in results:\n",
" eDist[F.n.v(vol)] += 1\n",
" \n",
"eDist"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is apparent that there are no entities in volume 14, because in version 0.8.1. there was no volume 14.\n",
"\n",
"So it is preferable that the third party repeats the entity recognition\n",
"on the new version of the corpus, so that the entities in volume 14 get recognized too.\n",
"\n",
"This has in fact happened. Sophie Arnoult has run the machinery again.\n",
"\n",
"Let's quickly load that version and compute the distribution of entities there."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"TF-app: ~/github/CLARIAH/wp6-missieven/app"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"data: ~/github/CLARIAH/wp6-missieven/tf/1.0"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"data: ~/github/CLARIAH/wp6-missieven/voc-missives/export/tf/1.0"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"This is Text-Fabric 10.2.6\n",
"Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html\n",
"\n",
"47 features found and 2 ignored\n",
" 5.66s All features loaded/computed - for details use TF.isLoaded()\n",
" 0.54s All additional features loaded - for details use TF.isLoaded()\n"
]
},
{
"data": {
"text/html": [
"Text-Fabric: Text-Fabric API 10.2.6, CLARIAH/wp6-missieven/app v3, Search Reference
Data: WP6-MISSIEVEN, Character table, Feature docs
Features:
\n",
"CLARIAH/wp6-missieven/voc-missives/export/tf
\n",
" \n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" identifier of a named entity
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" kind of a named entity
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
" \n",
"\n",
"General Missives Dutch East India Company 1600-1800
\n",
" \n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" authors of the letter, surnames only
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" authors of the letter, full names
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" column number of a column in a row in a table
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" day part of the date of the letter
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" whether a word is the denominator in fraction, e.g. 4 in 1/4
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" whether a word is emphasized by typography
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" a folio reference
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" whether a word belongs to footnote text
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" whether a word is the numerator in fraction, e.g. 1 in 1/4
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" whether a word belongs to original text
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" whether a word is a numerical fraction, e.g. 1/4
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" whether a word belongs to the text of reference
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" whether a word belongs to the text of editorial remarks
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" whether a word has special typography possibly with OCR mistakes as well
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" whether a word has subscript typography possibly indicating the denominator of a fraction
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" whether a word has superscript typography possibly indicating the numerator of a fraction
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" whether a word is underlined by typography
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" footnote mark (not necessarily the same as shown on the printed page
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" month part of the date of the letter
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" number of a volume, letter, page, para, line, table
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
"
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" number of the first page of this letter in this volume
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" place from where the letter was sent
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" punctuation and/or whitespace following a wordup to the next word
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" punctuation and/or whitespace following a word,up to the next word, footnote text only
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" punctuation and/or whitespace following a word,up to the next word, original text only
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" punctuation and/or whitespace following a word,up to the next word, remark text only
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" the date the letter was sent
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" row number of a row of column in a table
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" ('sequence number of this letter among the letters of the same author in this volume',)
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" status of the letter, e.g. secret, copy
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" title of the letter
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" transcription of a word
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" transcription of a word, only for footnote text
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" transcription of a word, only for original text
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" transcription of a word, only for remark text
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" volume number
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
str
\n",
"\n",
"
\n",
" the page-specific part of web links for page nodes
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" column offset of a column in a row in a table
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
int
\n",
"\n",
"
\n",
" year part of the date of the letter
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
none
\n",
"\n",
"
\n",
" edge between a word and the footnotes associated with it
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
none
\n",
"\n",
"
\n",
"
\n",
" \n",
" \n",
"\n",
"
\n",
"\n",
"
\n",
" \n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"A = use(\n",
" \"CLARIAH/wp6-missieven:clone\",\n",
" checkout=\"clone\",\n",
" mod=\"CLARIAH/wp6-missieven/voc-missives/export/tf:clone\",\n",
" hoist=globals(),\n",
" version=\"1.0\",\n",
" silent=\"verbose\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 2.46s 29159 results\n"
]
},
{
"data": {
"text/plain": [
"Counter({1: 2208,\n",
" 2: 1799,\n",
" 3: 2297,\n",
" 4: 1987,\n",
" 5: 1858,\n",
" 6: 4295,\n",
" 7: 3068,\n",
" 8: 1251,\n",
" 9: 2825,\n",
" 10: 1745,\n",
" 11: 2455,\n",
" 12: 1350,\n",
" 13: 977,\n",
" 14: 1044})"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = A.search(query)\n",
"\n",
"eDist = collections.Counter()\n",
"\n",
"for (vol, word) in results:\n",
" eDist[F.n.v(vol)] += 1\n",
" \n",
"eDist"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clearly, there have been additional changes leading to a very different version 1.0 than 0.8.1,\n",
"so at this point in time the migrated features (from 0.8.1 to 1.0) are practically obsolete."
]
},
{
"cell_type": "markdown",
"metadata": {
"jp-MarkdownHeadingCollapsed": true,
"tags": []
},
"source": [
"---\n",
"\n",
"# Contents\n",
"\n",
"* **[start](start.ipynb)** start computing with this corpus\n",
"* **[search](search.ipynb)** turbo charge your hand-coding with search templates\n",
"* **[compute](compute.ipynb)** sink down a level and compute it yourself\n",
"* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results\n",
"* **[annotate](annotate.ipynb)** export text, annotate with BRAT, import annotations\n",
"* **[share](share.ipynb)** draw in other people's data and let them use yours\n",
"* **[entities](entities.ipynb)** use results of third-party NER (named entity recognition)\n",
"* **porting** port features made against an older version to a newer version\n",
"* **[volumes](volumes.ipynb)** work with selected volumes only\n",
"\n",
"CC-BY Dirk Roorda"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.7"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}