"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"A = use(\n",
" f\"CLARIAH/wp6-missieven\",\n",
" hoist=globals(),\n",
" mod=(\n",
" f\"CLARIAH/wp6-missieven/exercises/entities/tf\",\n",
" f\"CLARIAH/wp6-missieven/exercises/numerics/tf\",\n",
" ),\n",
" version=VERSION,\n",
" silent=False,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above you see a new sections in the feature list that you can expand to see\n",
"which features that module contributed.\n",
"\n",
"Now, suppose did not know much about these feature, then we would like to do a few basic checks.\n",
"\n",
"A good start it to do inspect a frequency list of the values of the new features,\n",
"and then to perform a query looking for the nodes that have these features.\n",
"\n",
"We do that for the entity features and for the number feature."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Entities"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(('T11', 6),\n",
" ('T2', 5),\n",
" ('T13', 3),\n",
" ('T16', 3),\n",
" ('T8', 3),\n",
" ('T9', 3),\n",
" ('T10', 2),\n",
" ('T15', 2),\n",
" ('T17', 2),\n",
" ('T3', 2),\n",
" ('T5', 2),\n",
" ('T1', 1),\n",
" ('T12', 1),\n",
" ('T4', 1),\n",
" ('T6', 1),\n",
" ('T7', 1))"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"F.entityId.freqList()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(('Person', 18), ('GPE', 15), ('Organization', 5))"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"F.entityKind.freqList()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(('Ternate', 5), ('Amboina', 2))"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"F.entityComment.freqList()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's query all words that have an entity notation:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 4.40s 23 results\n"
]
}
],
"source": [
"query = \"\"\"\n",
"word entityId entityKind* entityComment*\n",
"\"\"\"\n",
"results = A.search(query)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we query all word where the `entityId` is present.\n",
"We also mention the `entityKind` and `entityComment` features, but with a `*` behind them.\n",
"That is a criterion that is always True, so these mentions do not alter the result list.\n",
"But now these features do occur in the query, and when we show results, these features will be shown."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"line 1"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
entityId=T11entityKind=Person
orancay
entityId=T11entityKind=Person
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 2"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
entityId=T11entityKind=Person
bij
entityId=T11entityKind=Person
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 3"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
entityId=T13entityKind=Person
arancay
entityId=T12entityKind=Person
Nera,
entityId=T1entityKind=Person
sabandaer
entityId=T13entityKind=Person
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 4"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
entityId=T11entityKind=Person
orancaye
entityId=T11entityKind=Person
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 5"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
entityId=T15entityKind=GPE
Hollanders
entityId=T15entityKind=GPE
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 6"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
entityId=T16entityKind=Person
Verhoeven
entityId=T16entityKind=Person
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 7"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
entityId=T17entityKind=GPE
Ambojna
entityId=T17entityKind=GPE
Ternnate
entityId=T2entityKind=GPE
Coninck
entityId=T3entityKind=Person
van
entityId=T3entityKind=Person
Spagnien
entityId=T4entityKind=GPE
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 8"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
entityId=T2entityKind=GPE
Heeren
entityId=T5entityKind=Organization
Staeten
entityId=T5entityKind=Organization
Coninck
entityId=T6entityKind=Person
ditto
entityId=T2entityKind=GPE
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 9"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
entityId=T2entityKind=GPE
plaetse
entityId=T2entityKind=GPE
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 10"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
entityId=T8entityKind=Organization
Coninck
entityId=T7entityKind=Person
Heeren
entityId=T8entityKind=Organization
Staeten,
entityId=T8entityKind=Organization
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 11"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
entityId=T9entityKind=GPE
Banda
entityId=T9entityKind=GPE
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line 12"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
entityId=T10entityKind=GPE
Engelsche
entityId=T10entityKind=GPE
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"A.show(results, condensed=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Observation**\n",
"\n",
"It's not only words that have entity features, also the lines themselves have gotten such annotations.\n",
"\n",
"It turns out that it is not very useful to annotate *lines* with entities this way.\n",
"It would be better to annotate them with the number of entities they contain.\n",
"That is our feedback to the creator of these annotations, and because we know the GitHub repo that they are from,\n",
"we can file an [issue](https://github.com/annotation/tutorials/issues/3)!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Numerics"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((121, 6),\n",
" (361, 5),\n",
" (421, 4),\n",
" (101, 3),\n",
" (141, 3),\n",
" (161, 3),\n",
" (185, 3),\n",
" (261, 3),\n",
" (131, 2),\n",
" (151, 2),\n",
" (240, 2),\n",
" (360, 2),\n",
" (621, 2),\n",
" (1241, 2),\n",
" (1441, 2),\n",
" (111, 1),\n",
" (171, 1),\n",
" (181, 1),\n",
" (191, 1),\n",
" (201, 1),\n",
" (241, 1),\n",
" (250, 1),\n",
" (281, 1),\n",
" (291, 1),\n",
" (331, 1),\n",
" (371, 1),\n",
" (480, 1),\n",
" (501, 1),\n",
" (541, 1),\n",
" (561, 1),\n",
" (631, 1),\n",
" (660, 1),\n",
" (670, 1),\n",
" (701, 1),\n",
" (721, 1),\n",
" (731, 1),\n",
" (761, 1),\n",
" (814, 1),\n",
" (901, 1),\n",
" (911, 1),\n",
" (1101, 1),\n",
" (1151, 1),\n",
" (1321, 1),\n",
" (1371, 1),\n",
" (1661, 1),\n",
" (1741, 1),\n",
" (2000, 1),\n",
" (2501, 1),\n",
" (2921, 1),\n",
" (2981, 1),\n",
" (2991, 1),\n",
" (3191, 1),\n",
" (3231, 1),\n",
" (3501, 1),\n",
" (4021, 1),\n",
" (4041, 1),\n",
" (4061, 1),\n",
" (5021, 1),\n",
" (6121, 1),\n",
" (8191, 1),\n",
" (8421, 1),\n",
" (8541, 1),\n",
" (9581, 1),\n",
" (9961, 1),\n",
" (10791, 1),\n",
" (10921, 1),\n",
" (11911, 1),\n",
" (15301, 1),\n",
" (15361, 1),\n",
" (20621, 1),\n",
" (32091, 1),\n",
" (49141, 1),\n",
" (52141, 1),\n",
" (65931, 1),\n",
" (72771, 1),\n",
" (75621, 1),\n",
" (77841, 1),\n",
" (94001, 1),\n",
" (98771, 1),\n",
" (128181, 1),\n",
" (167081, 1),\n",
" (925981, 1),\n",
" (977381, 1),\n",
" (1213781, 1),\n",
" (2228491, 1))"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"F.number.freqList()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that the values that we have generated before."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's show the original and the number side by side."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 1.87s 114 results\n"
]
}
],
"source": [
"results = A.search(\n",
" \"\"\"\n",
"word number transo*\n",
"\"\"\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"result 1"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
115J,
number=1151transo=115J
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"result 2"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
loopende),
transo=loopende
deminueert
transo=deminueert
d’advance
transo=d’advance
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"result 3"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
9400J
number=94001transo=9400J
waerschouwingh
transo=waerschouwingh
toecomende
transo=toecomende
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"result 4"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
coopmanschappen,
transo=coopmanschappen
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"result 5"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
row
cell
Eerstelijck:
transo=Eerstelijck
cell
cargasoen,
transo=cargasoen
bestaende
transo=bestaende
lijnwaten
transo=lijnwaten
diamanten,
transo=diamanten
gescheept,
transo=gescheept
Masilipatnam
transo=Masilipatnam
aengelandt ;
transo=aengelandt
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"result 6"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
row
cell
2062J
number=20621transo=2062J
cell
capitael,
transo=capitael
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"result 7"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
vercocht,
transo=vercocht
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"result 8"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
d’affgedrongen
transo=d’affgedrongen
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"result 9"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
12818J,
number=128181transo=12818J
overgaen.
transo=overgaen
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"result 10"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"line
Masilipatnam
transo=Masilipatnam
124J
number=1241transo=124J
besarsteen,
transo=besarsteen
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"A.show(results, start=1, end=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# All together!\n",
"\n",
"If more researchers have shared data modules, you can draw them all in.\n",
"\n",
"Then you can design queries that use features from all these different sources.\n",
"\n",
"In that way, you build your own research on top of the work of others."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hover over the features to see where they come from, and you'll see they come from your local GitHub repo."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# For real\n",
"\n",
"See the [next tutorial in this series](entities.ipynb) how you can\n",
"draw in and make use additional features produced by a serious algorithm to detect\n",
"named entities."
]
},
{
"cell_type": "markdown",
"metadata": {
"jp-MarkdownHeadingCollapsed": true,
"tags": []
},
"source": [
"---\n",
"\n",
"# Contents\n",
"\n",
"* **[start](start.ipynb)** start computing with this corpus\n",
"* **[search](search.ipynb)** turbo charge your hand-coding with search templates\n",
"* **[compute](compute.ipynb)** sink down a level and compute it yourself\n",
"* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results\n",
"* **[annotate](annotate.ipynb)** export text, annotate with BRAT, import annotations\n",
"* **share** draw in other people's data and let them use yours\n",
"* **[entities](entities.ipynb)** use results of third-party NER (named entity recognition)\n",
"* **[porting](porting.ipynb)** port features made against an older version to a newer version\n",
"* **[volumes](volumes.ipynb)** work with selected volumes only\n",
"\n",
"CC-BY Dirk Roorda"
]
}
],
"metadata": {
"jupytext": {
"encoding": "# -*- coding: utf-8 -*-"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.7"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}