{ "cells": [ { "cell_type": "markdown", "id": "25f9f4c7-036e-4f9b-b825-a4529df77819", "metadata": {}, "source": [ "# Text-Fabric and Alpino\n", "\n", "## Introduction\n", "\n", "Let's compare two tools that give users computational power over annotated corpora:\n", "\n", "[Text-Fabric](https://github.com/annotation/text-fabric)\n", "and\n", "[Alpino](https://urd2.let.rug.nl/~kleiweg/alpinograph-docs/), especially its graph-based\n", "query language.\n", "\n", "### How to digest an annotated corpus\n", "\n", "If you have an annotated corpus there are three ways you could *consume* the data:\n", "\n", "* browsing and reading,\n", "* computing (walk the corpus by means of a program that collects results),\n", "* querying.\n", "\n", "The advantage of computing and querying over browsing and reading is that you can find needles in haystacks,\n", "and filter and process the corpus in ways that are infeasible by the human eye and hand.\n", "\n", "Yet, when computing and querying are done, it is vitally important that users can read and browse around\n", "the results, in order to see what is happening in the corpus and get ideas for new computations and queries.\n", "\n", "The advantage of querying is that it can be done without programming, although the query language\n", "must be mastered. But that is still much easier and less time consuming than the art of programming.\n", "\n", "The disadvantage of querying is that sooner or later, when the research questions become increasingly complex,\n", "the query language tends to become a straightjacket. Users have to become over-ingenious in order to find\n", "the queries that work for them. They approach a point where it pays off to compute.\n", "\n", "In an ideal world, it should be easy to bridge the gap between querying and hand-coding smoothly.\n", "\n", "In the real world, this gap tends to be an enormous barrier.\n", "\n", "Results from the query engine are serialized from an internal representation to an external representation.\n", "In the worst case, the user has only access to the results by means of a web interface.\n", "In better cases the user can download results as data, in JSON or TSV, or TXT.\n", "\n", "Even then, important parts of the context tend to get lost. Where exactly are the results located in the corpus?\n", "Can I get from a result sentence to the sentence that immediately follows it?\n", "\n", "This gap can be made surmountable if the system would have an addressing system for every possible text-fragment\n", "in the corpus, and able to deliver those addresses within the results.\n", "\n", "There are also design characteristics of the query language that help to lessen the gap.\n", "\n", "*First of all*, a query should expose the terms of the corpus clearly and unaltered, and should be minimalistic\n", "in all other respects.\n", "\n", "*Secondly*, a query should mimic the pattern it is designed to retrieve, this helps when you want to search by example.\n", "\n", "*Thirdly*, a query should be able to express spatial relationships between text-fragments, such as \"contained-in\",\n", "\"overlapping\", \"completely before\", \"adjacent\", etc.\n", "\n", "*Fourthly*, a query should be able to combine spatial relationships with all other features that the corpus has on offer." ] }, { "cell_type": "markdown", "id": "65642655-9626-4df2-b930-558fc93512bb", "metadata": { "tags": [] }, "source": [ "## Text-Fabric\n", "\n", "In Text-Fabric we aim to produce such a query language. Here are a few characteristics:\n", "\n", "* Queries are `topographical`: a query is a relationship pattern, and the results are all instantiations of that\n", " pattern found in the corpus.\n", " \n", " \n", "* The query language is data-agnostic, but it can use the names of all features defined in the corpus.\n", "\n", "* The results of queries are tuples of nodes and nodes are integers." ] }, { "cell_type": "markdown", "id": "f5bce31c-f21d-454f-9c62-f16e9b821965", "metadata": {}, "source": [ "### Example\n", "\n", "Suppose the corpus\n", "\n", "* has nodes for `sentences`, `clauses`, `phrases`, `words`.\n", "* defines a feature `typ` for clauses and phrases\n", "* defines features `sp` (part-of-speech) and `g_cons` (consonantal transcription) for words.\n", "\n", "Then we can make a query that looks for NP phrases with a verb in it, where such phrases occur\n", "in clauses of type `Ptcp`. Oh, and the verb should begin with an `M`.\n", "\n", "**N.B.: This is a real-world example. You can reproduce this on your own computer.**" ] }, { "cell_type": "markdown", "id": "69f88bd1-5e2f-4661-b389-684d569f2bcc", "metadata": {}, "source": [ "```\n", "sentence\n", " clause typ=Ptcp\n", " phrase typ=NP\n", " word sp=verb g_cons~^M\n", "```" ] }, { "cell_type": "markdown", "id": "2b0eee04-f981-4643-878f-968cdbd0e9cd", "metadata": {}, "source": [ "And the query results form a tuple of individual results, where each individual result is a tuple\n", "\n", "`(s, c, p, w)`\n", "\n", "of a sentence node, a clause node, a phrase node and a word node." ] }, { "cell_type": "markdown", "id": "b6f2acac-7f09-452f-aa3e-103356484234", "metadata": {}, "source": [ "## Running a query\n", "\n", "In Text-Fabric, a query can be run from within a Python program.\n", "\n", "Load the software." ] }, { "cell_type": "code", "execution_count": 2, "id": "b7269223-c528-438c-8d7b-7d928751bf8d", "metadata": {}, "outputs": [], "source": [ "# assumed: pip install 'text-fabric[all]'\n", "\n", "from tf.app import use" ] }, { "cell_type": "markdown", "id": "386ac899-dcb2-4d7f-8951-eb6f4bd434ee", "metadata": {}, "source": [ "Load the data and give a handle to the API that gives access to it:" ] }, { "cell_type": "code", "execution_count": 3, "id": "a2da1771-8712-4fc1-92e4-0c70f57268f1", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**Locating corpus resources ...**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "app: ~/text-fabric-data/github/ETCBC/bhsa/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/bhsa/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/etcbc/phono/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/parallels/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " Text-Fabric: Text-Fabric API 11.1.3, ETCBC/bhsa/app v3, Search Reference
\n", " Data: BHSA, Character table, Feature docs
\n", "
Node types\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "
Name# of nodes# slots/node% coverage
book3910938.21100
chapter929459.19100
lex923046.22100
verse2321318.38100
half_verse451799.44100
sentence637176.70100
sentence_atom645146.61100
clause881314.84100
clause_atom907044.70100
phrase2532031.68100
phrase_atom2675321.59100
subphrase1138501.4238
word4265901.00100
\n", " Sets: no custom sets
\n", " Features:
\n", "
Parallel Passages\n", "
\n", "\n", "
\n", "
\n", "crossref\n", "
\n", "
int
\n", "\n", " 🆗 links between similar passages\n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis\n", "
\n", "\n", "
\n", "
\n", "book\n", "
\n", "
str
\n", "\n", " ✅ book name in Latin (Genesis; Numeri; Reges1; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "book@ll\n", "
\n", "
str
\n", "\n", " ✅ book name in amharic (ኣማርኛ)\n", "\n", "
\n", "\n", "
\n", "
\n", "chapter\n", "
\n", "
int
\n", "\n", " ✅ chapter number (1; 2; 3; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "code\n", "
\n", "
int
\n", "\n", " ✅ identifier of a clause atom relationship (0; 74; 367; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "det\n", "
\n", "
str
\n", "\n", " ✅ determinedness of phrase(atom) (det; und; NA.)\n", "\n", "
\n", "\n", "
\n", "
\n", "domain\n", "
\n", "
str
\n", "\n", " ✅ text type of clause (? (Unknown); N (narrative); D (discursive); Q (Quotation).)\n", "\n", "
\n", "\n", "
\n", "
\n", "freq_lex\n", "
\n", "
int
\n", "\n", " ✅ frequency of lexemes\n", "\n", "
\n", "\n", "
\n", "
\n", "function\n", "
\n", "
str
\n", "\n", " ✅ syntactic function of phrase (Cmpl; Objc; Pred; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons\n", "
\n", "
str
\n", "\n", " ✅ word consonantal-transliterated (B R>CJT BR> >LHJM ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons_utf8\n", "
\n", "
str
\n", "\n", " ✅ word consonantal-Hebrew (ב ראשׁית ברא אלהים)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_lex\n", "
\n", "
str
\n", "\n", " ✅ lexeme pointed-transliterated (B.:- R;>CIJT B.@R@> >:ELOH ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ lexeme pointed-Hebrew (בְּ רֵאשִׁית בָּרָא אֱלֹה)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word\n", "
\n", "
str
\n", "\n", " ✅ word pointed-transliterated (B.:- R;>CI73JT B.@R@74> >:ELOHI92JM)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word_utf8\n", "
\n", "
str
\n", "\n", " ✅ word pointed-Hebrew (בְּ רֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים)\n", "\n", "
\n", "\n", "
\n", "
\n", "gloss\n", "
\n", "
str
\n", "\n", " 🆗 english translation of lexeme (beginning create god(s))\n", "\n", "
\n", "\n", "
\n", "
\n", "gn\n", "
\n", "
str
\n", "\n", " ✅ grammatical gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "label\n", "
\n", "
str
\n", "\n", " ✅ (half-)verse label (half verses: A; B; C; verses: GEN 01,02)\n", "\n", "
\n", "\n", "
\n", "
\n", "language\n", "
\n", "
str
\n", "\n", " ✅ of word or lexeme (Hebrew; Aramaic.)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex\n", "
\n", "
str
\n", "\n", " ✅ lexeme consonantal-transliterated (B R>CJT/ BR>[ >LHJM/)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ lexeme consonantal-Hebrew (ב ראשׁית֜ ברא אלהים֜)\n", "\n", "
\n", "\n", "
\n", "
\n", "ls\n", "
\n", "
str
\n", "\n", " ✅ lexical set, subclassification of part-of-speech (card; ques; mult)\n", "\n", "
\n", "\n", "
\n", "
\n", "nametype\n", "
\n", "
str
\n", "\n", " ⚠️ named entity type (pers; mens; gens; topo; ppde.)\n", "\n", "
\n", "\n", "
\n", "
\n", "nme\n", "
\n", "
str
\n", "\n", " ✅ nominal ending consonantal-transliterated (absent; n/a; JM, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "nu\n", "
\n", "
str
\n", "\n", " ✅ grammatical number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "number\n", "
\n", "
int
\n", "\n", " ✅ sequence number of an object within its context\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "pargr\n", "
\n", "
str
\n", "\n", " 🆗 hierarchical paragraph number (1; 1.2; 1.2.3.4; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pdp\n", "
\n", "
str
\n", "\n", " ✅ phrase dependent part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pfm\n", "
\n", "
str
\n", "\n", " ✅ preformative consonantal-transliterated (absent; n/a; J, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix consonantal-transliterated (absent; n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_gn\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_nu\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_ps\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "ps\n", "
\n", "
str
\n", "\n", " ✅ grammatical person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere\n", "
\n", "
str
\n", "\n", " ✅ word pointed-transliterated masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer\n", "
\n", "
str
\n", "\n", " ✅ interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer_utf8\n", "
\n", "
str
\n", "\n", " ✅ interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_utf8\n", "
\n", "
str
\n", "\n", " ✅ word pointed-Hebrew masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "rank_lex\n", "
\n", "
int
\n", "\n", " ✅ ranking of lexemes based on freqnuecy\n", "\n", "
\n", "\n", "
\n", "
\n", "rela\n", "
\n", "
str
\n", "\n", " ✅ linguistic relation between clause/(sub)phrase(atom) (ADJ; MOD; ATR; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp\n", "
\n", "
str
\n", "\n", " ✅ part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "st\n", "
\n", "
str
\n", "\n", " ✅ state of a noun (a (absolute); c (construct); e (emphatic).)\n", "\n", "
\n", "\n", "
\n", "
\n", "tab\n", "
\n", "
int
\n", "\n", " ✅ clause atom: its level in the linguistic embedding\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer\n", "
\n", "
str
\n", "\n", " ✅ interword material pointed-transliterated (& 00 05 00_P ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer_utf8\n", "
\n", "
str
\n", "\n", " ✅ interword material pointed-Hebrew (־ ׃)\n", "\n", "
\n", "\n", "
\n", "
\n", "txt\n", "
\n", "
str
\n", "\n", " ✅ text type of clause and surrounding (repetion of ? N D Q as in feature domain)\n", "\n", "
\n", "\n", "
\n", "
\n", "typ\n", "
\n", "
str
\n", "\n", " ✅ clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)\n", "\n", "
\n", "\n", "
\n", "
\n", "uvf\n", "
\n", "
str
\n", "\n", " ✅ univalent final consonant consonantal-transliterated (absent; N; J; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbe\n", "
\n", "
str
\n", "\n", " ✅ verbal ending consonantal-transliterated (n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbs\n", "
\n", "
str
\n", "\n", " ✅ root formation consonantal-transliterated (absent; n/a; H; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "verse\n", "
\n", "
int
\n", "\n", " ✅ verse number\n", "\n", "
\n", "\n", "
\n", "
\n", "voc_lex\n", "
\n", "
str
\n", "\n", " ✅ vocalized lexeme pointed-transliterated (B.: R;>CIJT BR> >:ELOHIJM)\n", "\n", "
\n", "\n", "
\n", "
\n", "voc_lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ vocalized lexeme pointed-Hebrew (בְּ רֵאשִׁית ברא אֱלֹהִים)\n", "\n", "
\n", "\n", "
\n", "
\n", "vs\n", "
\n", "
str
\n", "\n", " ✅ verbal stem (qal; piel; hif; apel; pael)\n", "\n", "
\n", "\n", "
\n", "
\n", "vt\n", "
\n", "
str
\n", "\n", " ✅ verbal tense (perf; impv; wayq; infc)\n", "\n", "
\n", "\n", "
\n", "
\n", "mother\n", "
\n", "
none
\n", "\n", " ✅ linguistic dependency between textual objects\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
Phonetic Transcriptions\n", "
\n", "\n", "
\n", "
\n", "phono\n", "
\n", "
str
\n", "\n", " 🆗 phonological transcription (bᵊ rēšˌîṯ bārˈā ʔᵉlōhˈîm)\n", "\n", "
\n", "\n", "
\n", "
\n", "phono_trailer\n", "
\n", "
str
\n", "\n", " 🆗 interword material in phonological transcription\n", "\n", "
\n", "\n", "
\n", "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A = use(\"ETCBC/bhsa\") # the corpus data is retrieved from github.com/ETCBC/bhsa and then cached locally" ] }, { "cell_type": "markdown", "id": "7a6477a5-8a5e-45ff-aab7-41c44022f5e5", "metadata": {}, "source": [ "Write a query:" ] }, { "cell_type": "code", "execution_count": 4, "id": "f53f1fb2-3add-4d02-b2c0-e97b574c90ee", "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "sentence\n", " clause typ=Ptcp\n", " phrase typ=NP\n", " word sp=verb g_cons~^M\n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "4549981e-1dc5-4923-aa59-528cb30cc8d5", "metadata": {}, "source": [ "Run the query:" ] }, { "cell_type": "code", "execution_count": 5, "id": "04d981da-8152-4e8e-b797-4b24d68fc2a6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.38s 20 results\n" ] } ], "source": [ "results = A.search(query)" ] }, { "cell_type": "markdown", "id": "94b4f260-f23f-4d83-b07b-0311fb6f8919", "metadata": {}, "source": [ "Show the results as nodes:" ] }, { "cell_type": "code", "execution_count": 6, "id": "be720366-e647-432c-b382-a69f463daa6e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(1174454, 430355, 660050, 14127),\n", " (1177926, 434807, 673387, 35121),\n", " (1187343, 448669, 715553, 112534),\n", " (1187717, 449226, 717232, 115558),\n", " (1187736, 449248, 717296, 115657),\n", " (1202000, 468385, 774745, 213096),\n", " (1202370, 468879, 776134, 215412),\n", " (1210183, 479652, 805689, 262762),\n", " (1217009, 488825, 831501, 304239),\n", " (1217856, 489988, 834658, 309591),\n", " (1217882, 490016, 834736, 309691),\n", " (1217899, 490036, 834790, 309767),\n", " (1226243, 501363, 863729, 350269),\n", " (1226332, 501471, 864016, 350676),\n", " (1226471, 501648, 864488, 351373),\n", " (1226625, 501858, 865033, 352145),\n", " (1227200, 502606, 866980, 355006),\n", " (1227224, 502635, 867058, 355103),\n", " (1229566, 506177, 876769, 370968),\n", " (1231842, 509729, 887037, 390501)]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results" ] }, { "cell_type": "markdown", "id": "f502c478-f484-4a07-8633-90ac05a38012", "metadata": {}, "source": [ "Dress-up the results and show the first two of them in a table:" ] }, { "cell_type": "code", "execution_count": 7, "id": "9432c509-4f7a-4acc-adfa-34aab9b613aa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
npsentenceclausephraseword
1Genesis 27:29 וּֽ מְבָרֲכֶ֖יךָ בָּרֽוּךְ׃ וּֽ מְבָרֲכֶ֖יךָ בָּרֽוּךְ׃ מְבָרֲכֶ֖יךָ מְבָרֲכֶ֖יךָ
2Exodus 12:19 כִּ֣י׀ כָּל־ אֹכֵ֣ל מַחְמֶ֗צֶת וְ נִכְרְתָ֞ה הַנֶּ֤פֶשׁ הַהִוא֙ מֵעֲדַ֣ת יִשְׂרָאֵ֔ל בַּגֵּ֖ר וּבְאֶזְרַ֥ח הָאָֽרֶץ׃ אֹכֵ֣ל מַחְמֶ֗צֶת מַחְמֶ֗צֶת מַחְמֶ֗צֶת
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table(results, end=2)" ] }, { "cell_type": "markdown", "id": "9da53ef6-15c9-4fc7-b2ed-7b32524d835c", "metadata": {}, "source": [ "Change to the ASCII transliteration of the consonants of each word, and show all results in a table, but hide the\n", "last three columns." ] }, { "cell_type": "code", "execution_count": 8, "id": "bf4ae7fd-46ec-46fa-99c4-21510d1de44f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
npsentenceclausephraseword
1Genesis 27:29 W MBRKJK BRWK00
2Exodus 12:19 KJ05 KL& >KL MXMYT W NKRTH HNPC HHW> M<DT JFR>L BGR WB>ZRX H>RY00
3Deuteronomy 33:20 BRWK MRXJB GD
4Joshua 6:9 W HM>SP HLK >XRJ H>RWN
5Joshua 6:13 W HM>SP HLK >XRJ >RWN JHWH
6Isaiah 3:12 <MJ M>CRJK MT<JM W DRK >RXTJK BL<W00_S
7Isaiah 9:15 W M>CRJW MBL<JM00
8Jeremiah 51:1 HNNJ M<JR <L&BBL W>L&JCBJ LB QMJ RWX MCXJT00
9Haggai 1:6 W HMFTKR MFTKR >L&YRWR NQWB00_P
10Malachi 1:7 MGJCJM <L&MZBXJ LXM MG>L
11Malachi 1:11 W BKL&MQWM MQVR MGC LCMJ WMNXH VHWRH
12Malachi 1:14 W ZBX MCXT L>DNJ
13Proverbs 13:12 TWXLT MMCKH MXLH& LB
14Proverbs 14:31 W MKBDW XNN >BJWN00
15Proverbs 17:4 MR< MQCJB <L&FPT&>WN
16Proverbs 20:2 MT<BRW XWV> NPCW00
17Proverbs 29:15 W N<R MCLX MBJC >MW00
18Proverbs 29:26 RBJM MBQCJM PNJ&MWCL
19Daniel 2:22 HW> GL> <MJQT> WMSTRT>
20Nehemiah 12:47 W KL&JFR>L BJMJ ZRBBL WBJMJ NXMJH NTNJM MNJWT HMCRRJM WHC<RJM DBR&JWM BJWMW
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table(results, fmt=\"text-trans-plain\", skipCols={2, 3, 4})" ] }, { "cell_type": "markdown", "id": "775692eb-9eb1-432f-9878-cfc15a688a10", "metadata": {}, "source": [ "Make the `text-trans-plain` format the default and show results by sentence until further notice." ] }, { "cell_type": "code", "execution_count": 9, "id": "7d46de2d-1c48-4728-94be-41028baed688", "metadata": {}, "outputs": [], "source": [ "A.displaySetup(fmt=\"text-trans-plain\", condenseType=\"sentence\")" ] }, { "cell_type": "markdown", "id": "77003d76-bd3b-4efa-ad93-979c0582eb0f", "metadata": {}, "source": [ "Expand result 11, because something is going on there: a phrase gets interrupted!" ] }, { "cell_type": "code", "execution_count": 10, "id": "7d5f7b24-22e7-40b8-92cc-eac0bd012c77", "metadata": {}, "outputs": [ { "data": { "text/html": [ "

result 11" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

sentence 58
clause Ptcp NA
typ=Ptcp
phrase CP Conj
typ=CP
g_cons=Wsp=conj
phrase PP Loca
typ=PP
g_cons=Bsp=prep
g_cons=KLsp=subs
g_cons=MQWMsp=subs
phrase NP Subj
typ=NP
g_cons=MQVRsp=verb
phrase VP PreC
typ=VP
g_cons=MGCsp=verb
phrase PP Adju
typ=PP
g_cons=Lsp=prep
g_cons=CMJsp=subs
phrase NP Subj
typ=NP
g_cons=Wsp=conj
g_cons=MNXHsp=subs
g_cons=VHWRHsp=adjv
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.show(results, start=11, end=11)" ] }, { "cell_type": "markdown", "id": "58c0d4d1-56b7-41c5-b6db-db1204a9433a", "metadata": {}, "source": [ "**N.B.: None of the words `sentence`, `clause`, `phrase`, `word`, `typ`, `sp`, `g_cons`\n", "`Ptcp`, `NP`, is built into Text-Fabric, they are taken from the corpus organisation.**\n", "\n", "So the text of the query is almost entirely made up of terms that are familiar if you know the corpus.\n", "\n", "In order to get to know the corpus, the user needs to consult a\n", "*feature documentation document*. For the Hebrew Bible that looks like \n", "[this](https://etcbc.github.io/bhsa/features/0_home/).\n", "\n", "There is much more to search in Text-Fabric.\n", "\n", "Here are the\n", "[search docs](https://annotation.github.io/text-fabric/tf/about/searchusage.html#what-is-text-fabric-search)\n", "and here is a search\n", "[tutorial for the Hebrew Bible](https://nbviewer.org/github/ETCBC/bhsa/blob/master/tutorial/search.ipynb)." ] }, { "cell_type": "markdown", "id": "63283a89-2164-448c-9596-1a98b224877e", "metadata": {}, "source": [ "## Computing with the results\n", "\n", "The fact that the results are just tuples of integers makes it easy post-process results with\n", "your own code.\n", "\n", "Suppose you want to limit the results to those sentences that do not have *hapaxes*,\n", "then you can write your own Python code to do that.\n", "\n", "Suppose the corpus has a word feature `freq_lex` that for each word occurrence has the number\n", "of occurrences of the lexeme of the word in the corpus.\n", "\n", "Then we can filter like this:" ] }, { "cell_type": "code", "execution_count": 11, "id": "05ae4d95-d4f0-4a90-ad60-a5ba4f8c78e3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "len(wantedResults)=19 len(unwantedResults)=1\n" ] } ], "source": [ "F = A.api.Feature # the API to retrieve feature values\n", "L = A.api.Locality # the API to navigate to nodes in the neighbourhood\n", "\n", "unwantedResults = []\n", "wantedResults = []\n", "\n", "for result in results:\n", " s = result[0]\n", " words = L.d(s, otype=\"word\")\n", " hasHapax = any(F.freq_lex.v(w) == 1 for w in words)\n", " if hasHapax:\n", " unwantedResults.append(result)\n", " else:\n", " wantedResults.append(result)\n", " \n", "print(f\"{len(wantedResults)=} {len(unwantedResults)=}\")" ] }, { "cell_type": "markdown", "id": "017ba272-576d-45ef-bd31-79119a6e45ac", "metadata": {}, "source": [ "Let's show the unwanted result and show the `freq_lex` feature for all words:" ] }, { "cell_type": "code", "execution_count": 12, "id": "bf29442c-02d0-4c35-9c8a-b1f5288b75ca", "metadata": {}, "outputs": [ { "data": { "text/html": [ "

result 1" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

sentence 68
clause Ptcp NA
typ=Ptcp
phrase PPrP Subj
typ=PPrP
freq_lex=15g_cons=HW>sp=prps
phrase VP PreC
typ=VP
freq_lex=9g_cons=GL>sp=verb
phrase NP Objc
typ=NP
freq_lex=1g_cons=<MJQT>sp=adjv
freq_lex=731g_cons=Wsp=conj
freq_lex=1g_cons=MSTRT>sp=verb
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.show(unwantedResults, extraFeatures=\"freq_lex\")" ] }, { "cell_type": "markdown", "id": "856ffc7a-d561-4bf3-b0f8-ca8484b7aab8", "metadata": {}, "source": [ "This ends the Text-Fabric demo.\n", "\n", "# Comparison with Alpino\n", "\n", "The Alpino system has a graph-based\n", "[query language](https://urd2.let.rug.nl/~kleiweg/alpinograph-docs/zoeken/).\n", "\n", "Let's make an explorative comparison between this Alpino way of searching and Text-Fabric, because there seems\n", "to be quite a bit of convergence between the two.\n", "\n", "## Data model\n", "\n", "Alpino works with `nodes` and `words`.\n", "\n", "Text-Fabric works with `nodes`. Some nodes are the atomic ones, called `slots`, they are the textual positions.\n", "What you find at a slot depends on what the corpus modeller has chosen, but quite often slots correspond to words.\n", "\n", "Alpino nodes have categories (`cat`), e.g. `NP`, `PP`, `SMAIN`.\n", "\n", "Text-Fabric nodes have a type (`otype` = object type). Above we saw node types `sentence`, `clause`, `phrase`, `word`,\n", "but this is the choice of the corpus modeller. Text-Fabric expects in every corpus a feature file `otype` that\n", "maps all nodes to types.\n", "\n", "For example, the BHSA above has this `otype.tf`:\n", "\n", "```\n", "1-426590\tword\n", "426591-426629\tbook\n", "426630-427558\tchapter\n", "427559-515689\tclause\n", "515690-606393\tclause_atom\n", "606394-651572\thalf_verse\n", "651573-904775\tphrase\n", "904776-1172307\tphrase_atom\n", "1172308-1236024\tsentence\n", "1236025-1300538\tsentence_atom\n", "1300539-1414388\tsubphrase\n", "1414389-1437601\tverse\n", "1437602-1446831\tlex\n", "```\n", "\n", "This is shorthand for a mapping of the integers 1..1446831 to strings `word`, `book`, ... , `lex`.\n", "\n", "By the way, all data files of a TF corpus are in this format, and each file specifies a mapping from\n", "numbers (or pairs of numbers) to values, which can be numbers or strings." ] }, { "cell_type": "markdown", "id": "b7e0727b-6426-40da-ace9-bcfa489fc8f2", "metadata": {}, "source": [ "## Edges\n", "\n", "In a graph there are also edges. How do you search for nodes that are connected by certain edges?\n", "\n", "Both in Alpino and Text-Fabric edges may have properties that can be used in queries.\n", "\n", "An example Alpino query:\n", "\n", "```\n", "match (n:node{cat:'pp'})-[:rel{rel:'hdf'}]->(:nw)\n", "return n\n", "```\n", "\n", "Look for a PP node that is connected to an other node or word by means of an edge with `ref` property being\n", "the string `hdf`.\n", "\n", "In Text-Fabric we can also make queries like this.\n", "\n", "We have edges between similar verses and these edges are labeled with the similarity of both verses in percents.\n", "\n", "First we look for verses that are 90% similar, and then for verses that are for more than 90% similar." ] }, { "cell_type": "code", "execution_count": 13, "id": "4b82fe6b-5438-4051-9a1e-5f5b011204f7", "metadata": {}, "outputs": [], "source": [ "query1 = \"\"\"\n", "verse\n", "-crossref=90> verse\n", "\"\"\"\n", "\n", "query2 = \"\"\"\n", "verse\n", "-crossref>90> verse\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 14, "id": "33acc050-0dc5-4667-ae2b-54164768de62", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.03s 240 results\n", " 0.04s 9574 results\n" ] } ], "source": [ "results1 = A.search(query1)\n", "results2 = A.search(query2)" ] }, { "cell_type": "code", "execution_count": 15, "id": "aa2bd4fc-9532-4e75-ba59-b275853f5365", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
npverseverse
1Genesis 25:31 W J>MR J<QB MKRH KJWM >T&BKRTK LJ00 W J>MR J<QB HCB<H LJ KJWM W JCB< LW W JMKR >T&BKRTW LJ<QB00
2Genesis 25:33 W J>MR J<QB HCB<H LJ KJWM W JCB< LW W JMKR >T&BKRTW LJ<QB00 W J>MR J<QB MKRH KJWM >T&BKRTK LJ00
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table(results1, end=2, condenseType=\"verse\", full=True)" ] }, { "cell_type": "markdown", "id": "1b86c3c4-ef50-4ca7-b087-beb0e12029dc", "metadata": {}, "source": [ "Alpino does a good job in understanding linguistics. It has quite a bit of meaningful\n", "linguistic relationships, and its corpora supply the data for those.\n", "\n", "Text-Fabric is different. It is much more agnostic. It only assumes that there is an ordered set of slots\n", "plus nodes that represent certain subsets of slots; the nodes are divided into types.\n", "\n", "Both the types and the subsets are given with the corpus as TF data. Above we saw the `otype.tf` file,\n", "but there is also an `oslots.tf` file, that maps each non-slot node to the set of slots it is linked to." ] }, { "cell_type": "markdown", "id": "e9d5b6de-6aa3-424e-b82b-e32ae9140691", "metadata": {}, "source": [ "## Query results\n", "\n", "What do queries return?\n", "\n", "In Alpino they return nodes and/or values that certain features have for those nodes.\n", "\n", "In Text-Fabric they return tuples of nodes. In a TF query, most lines specify a node with properties that\n", "the query has to instantiate in the corpus. If a query specifies 10 such nodes, then the query results\n", "are 10-tuples of nodes in the corresponding order.\n", "\n", "The nodes returned are naked, not dressed-up with features.\n", "\n", "In real-life, people issue TF-queries either in the Text-Fabric browser, where they can customise how\n", "query results are displayed, or they can catch the result nodes in their programs (which run typically in a \n", "Jupyter notebook, but by no means necessarily so).\n", "\n", "In the Text-Fabric browser users can influence the features that must be displayed in various ways, one of them \n", "being: if a feature is mentioned in a query, then it is displayed.\n", "Users can also categorically request features, or inhibit certain standard features that are displayed by default.\n", "(These defaults are not Text-Fabric things, but set by the corpus modeller).\n", "\n", "In Jupyter notebooks, users can programmatically achieve the same effects by using the functions `table()` and `show()`\n", "and supplying various keyword arguments.\n", "\n", "All in all, given the difference in purpose, technology and scope between Text-Fabric and Alpino,\n", "the underlying concepts map fairly well from the one to the other." ] }, { "cell_type": "markdown", "id": "cfdd4592-24b6-4e02-b836-1d0055d67f00", "metadata": {}, "source": [ "## More complicated queries\n", "\n", "It is not always the case that a query is a neatly nested template. In fact, such queries\n", "only use the \"embedding\" relationship, but there are much more relationships.\n", "\n", "In Alpino, you can give names to the nodes in a query, and the same is true for Text-Fabric.\n", "\n", "For example:" ] }, { "cell_type": "code", "execution_count": 16, "id": "56938af3-840a-4dab-87d1-b6d2889a0cb7", "metadata": { "tags": [] }, "outputs": [], "source": [ "query = \"\"\"\n", "\n", "clause\n", " phrase\n", " := w1:word sp=verb\n", " <: phrase\n", " =: w2:word sp=verb\n", "\n", "w1 .lex. w2\n", "\"\"\" " ] }, { "cell_type": "markdown", "id": "7957da3c-7661-41bc-a99a-3a49ee9df7c4", "metadata": {}, "source": [ "This means: find a verb in two phrases of a clause The first verb should be the last word in its phrase,\n", "the second verb should be the first word in its phrase.\n", "\n", "The last line states a relationship between the two words: there lexeme value should be identical." ] }, { "cell_type": "code", "execution_count": 17, "id": "39ef811a-e31c-4d34-a13d-96479533df3e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.66s 475 results\n" ] } ], "source": [ "results = A.search(query)" ] }, { "cell_type": "code", "execution_count": 18, "id": "ae334354-4865-4d96-8c5a-2e28916d2dcc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
npclausephrasewordphraseword
1Genesis 2:16 MKL <Y&HGN >KL T>KL00
2Genesis 2:17 KJ BJWM MWT TMWT00
3Genesis 3:4 L>& MWT TMTWN00
4Genesis 3:16 HRBH >RBH <YBWNK WHRNK
5Genesis 8:7 W JY> JYW> WCWB
6Genesis 12:3 W >BRKH MBRKJK
7Genesis 15:13 JD< TD<
8Genesis 16:10 HRBH >RBH >T&ZR<K
9Genesis 17:13 HMWL05 JMWL JLJD BJTK WMQNT KSPK
10Genesis 18:10 CWB >CWB >LJK K<T XJH
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table(results, end=10, skipCols={2, 3, 4, 5})" ] }, { "cell_type": "markdown", "id": "770e289f-afef-4366-b65e-9e8f18dd656d", "metadata": {}, "source": [ "## Quantifiers\n", "\n", "In Alpino you can use quantifiers. These are parts of a query where you look for the\n", "existence or non-existence of certain patterns.\n", "This seems to be a bit of a problematic device, because there are certain conditions on quantifiers.\n", "\n", "In Text-Fabric it is not different: here there are also quantifiers, and here there are also restrictions\n", "on the quantifier expressions.\n", "\n", "In Text-Fabric, quantified parts of the query do not contribute to the result tuple.\n", "\n", "Here is an example.\n", "\n", "Let's see how many `VP`-phrases there are:" ] }, { "cell_type": "code", "execution_count": 19, "id": "e95a4123-f369-4d8d-9938-111aa0e8f106", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.16s 69024 results\n" ] } ], "source": [ "resultsVP = A.search(\"\"\"phrase typ=VP\"\"\")" ] }, { "cell_type": "markdown", "id": "f2aa9212-c927-4657-9285-04c4b50e114f", "metadata": {}, "source": [ "Sometimes VPs contain a noun:" ] }, { "cell_type": "code", "execution_count": 20, "id": "534d6411-3b39-4dae-95f3-89c2c49fa28f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.31s 234 results\n" ] } ], "source": [ "query = \"\"\"\n", "phrase typ=VP\n", " word sp=subs\n", "\"\"\"\n", "resultsWithNoun = A.search(query, shallow=True)" ] }, { "cell_type": "markdown", "id": "772bf015-0a08-4d03-a39a-c1cf7f9402da", "metadata": {}, "source": [ "Note the `shallow=True`: this means that we deliver the results differently: not as a tuple of tuples,\n", "but as a set of nodes that correspond to the first node in the query: the `phrase`." ] }, { "cell_type": "markdown", "id": "aa91ad10-ce1b-4252-8bf1-b203f58249e2", "metadata": {}, "source": [ "If we want the VPs without nouns:" ] }, { "cell_type": "code", "execution_count": 21, "id": "d1ca5aba-c244-4686-9c0f-2b7d2391a362", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.35s 68790 results\n" ] } ], "source": [ "resultsWithoutNoun = A.search(\"\"\"\n", "phrase typ=VP\n", "/without/\n", " word sp=subs\n", "/-/\n", "\"\"\")" ] }, { "cell_type": "markdown", "id": "f406fd9e-13c8-44ef-a3a7-f4e3369af434", "metadata": {}, "source": [ "A check to see if the results add up:" ] }, { "cell_type": "code", "execution_count": 22, "id": "da11201b-43ed-4765-85cc-258fe6090b58", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(resultsVP) == len(resultsWithNoun) + len(resultsWithoutNoun)" ] }, { "cell_type": "markdown", "id": "90db1592-6737-4dc5-aab1-e0ce07770dc4", "metadata": {}, "source": [ "OK, the numbers of results of the different queries are as expected, but we can also\n", "compare the results themselves:" ] }, { "cell_type": "code", "execution_count": 23, "id": "417a5402-9471-43e5-a7e6-46412c4c0c9c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "set(r[0] for r in resultsVP) == resultsWithNoun | set(r[0] for r in resultsWithoutNoun)" ] }, { "cell_type": "markdown", "id": "4c6c009e-d86c-48cc-b3b8-eff962b7c3b9", "metadata": {}, "source": [ "If we want the VPs with only verbs:" ] }, { "cell_type": "code", "execution_count": 24, "id": "7ad34e9f-1a83-498e-a418-89197658ba5d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.45s 62771 results\n" ] } ], "source": [ "resultsVerb1 = A.search(\"\"\"\n", "phrase typ=VP\n", "/without/\n", " word sp#verb\n", "/-/\n", "\"\"\")" ] }, { "cell_type": "markdown", "id": "ede630c8-e795-444f-b478-a1fc325c5fcb", "metadata": {}, "source": [ "Or, in a slightly different way, showing a different quantifier:" ] }, { "cell_type": "code", "execution_count": 25, "id": "03a2689a-60f4-4110-8d7c-b29585fff4e0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.70s 62771 results\n" ] } ], "source": [ "resultsVerb2 = A.search(\"\"\"\n", "phrase typ=VP\n", "/where/\n", " w:word\n", "/have/\n", " w sp=verb\n", "/-/\n", "\"\"\")" ] }, { "cell_type": "markdown", "id": "103011c1-0978-4c90-bf70-cca40cd23624", "metadata": {}, "source": [ "## Counting and sorting\n", "\n", "Unlike Alpino, in Text-Fabric there are no SQL-like constructs to count, group and aggregate results,\n", "except the `shallow=True` parameter, which reduces all results that have the same first node\n", "to a single result consisting of that node only.\n", "\n", "Text-Fabric relies on post-processing by the user, either in the program in which he\n", "issued the query, or in other programs in which he imports results exported from the\n", "Text-Fabric browser.\n", "\n", "There is also an API function [A.export()](https://annotation.github.io/text-fabric/tf/advanced/display.html#tf.advanced.display.export) for use in your programs\n", "to export results as tab-separated tables of dressed-up nodes.\n", "\n", "Concerning *sorting*: the search function can be passed a sort key to order results.\n", "If `sort=True` is passed, the results are ordered by the text-induced ordering of the result tuples." ] }, { "cell_type": "markdown", "id": "3adecf39-2ddc-49f9-aab8-f3f4c1c1fdf1", "metadata": {}, "source": [ "## Set-theoretic operations on results\n", "\n", "Alpino provides set-theoretic operations on results, in Text-Fabric the user\n", "has to do that by means of post-processing.\n", "\n", "Note that Text-Fabric constrains search templates: they have to be connected components, in the sense\n", "that between every pair of nodes in the query template there must be a path of relationships.\n", "\n", "If a search template would consist of multiple connected components, the result would be the cartesian product of the results of the individual queries.\n", "\n", "Such result sets are potentially monstrous, and it is unlikely that the user can and will deal with them, so they are prohibited." ] }, { "cell_type": "markdown", "id": "bfcbc8df-792f-45c3-8c25-db3e4473ad0d", "metadata": {}, "source": [ "# Limitations\n", "\n", "Alpino search is based on Cypher and SQL. Limitations in Cypher queries can sometimes be\n", "compensated by excursions into SQL.\n", "\n", "Also in Text-Fabric the expressive power of queries is limited. \n", "Moreover, there are also queries that can be expressed but require too much time to execute.\n", "\n", "That is partly due to lack of sophistication in the Text-Fabric engine and partly due to \n", "inherent complexity of spatial relationships between nodes.\n", "\n", "In Text-Fabric we do not have an escape to an other query language.\n", "Instead, the escape is to hand-coding. \n", "\n", "There is a \n", "[tutorial notebook](https://nbviewer.org/github/ETCBC/bhsa/blob/master/tutorial/searchGaps.ipynb)\n", "in which we explore a difficult query task.\n", "Although we can solve it by a query, we also do it by hand-coding.\n", "We make sure both give the same result and then we save the result as a *named set* to disk.\n", "\n", "We can then invoke Text-Fabric later on with a parameter to include this named set.\n", "At that moment the name of the set can be used in queries in place where a node type\n", "is expected.\n", "\n", "In the above notebook the section **Custom sets for (non-)gapped phrases** shows\n", "how that works." ] }, { "cell_type": "markdown", "id": "60ba2e71-859e-4fd1-90ee-f10fa1980485", "metadata": {}, "source": [ "# Pre-computation\n", "\n", "In Alpino there are pre-computed pieces of data, e.g. the feature `vor_feld`.\n", "This is advertised as a device that simplifies queries and makes them much more efficient.\n", "\n", "In Text-Fabric pre-computation is also used, at several levels.\n", "\n", "## Internal pre-computation\n", "\n", "Some of the pre-computation belongs to the internals of Text-Fabric, such as\n", "spatial indexes that facilitate the computation of embedding relations between nodes, among\n", "other things.\n", "See for example [`levUp`](https://annotation.github.io/text-fabric/tf/core/prepare.html#tf.core.prepare.levUp).\n", "Such precomputed data is also made available in raw form to the end user \n", "through the [Computed API](https://annotation.github.io/text-fabric/tf/cheatsheet.html#c-computed-data-components).\n", "\n", "## Corpus-level pre-computation\n", "\n", "At the second level there are computed features that the corpus modeller has included into the \n", "corpus. For example, in the BHSA there are features for the frequency and rank of words and lexemes.\n", "See [freq/rank](https://etcbc.github.io/bhsa/features/freq_lex/).\n", "\n", "There are also features that have been added by others to the corpus as a separate module.\n", "The [similarity](https://nbviewer.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb)\n", "feature that we encountered before is an example of that.\n", "\n", "## User-generated pre-computation\n", "\n", "The named sets before are an example where end users themselves can compute data that are\n", "helpful in subsequent queries.\n", "\n", "This is the ethos of Text-Fabric: that end users, corpus modellers, and researchers have maximum\n", "scope and flexibility to compute with the corpus." ] }, { "cell_type": "markdown", "id": "14009e68-9037-4ec9-8864-ca3cebcc2f71", "metadata": {}, "source": [ "# Alpino and Text-Fabric\n", "\n", "Alpino and Text-Fabric are very different in the size of corpora they deal with and the specific\n", "features they assume to be present in the corpora.\n", "\n", "Alpino corpora are linguistic corpora, Text-Fabric corpora do not have to be linguistic.\n", "There are even corpora in Text-Fabric whose texts are not in a language, such as the\n", "proto-cuneiform tablets of [Uruk](https://github.com/Nino-cunei/uruk).\n", "\n", "Can there be synergies between the Alpino world and the Text-Fabric world?\n", "\n", "## Alpino helps Text-Fabric\n", "\n", "There are several corpora in Text-Fabric that could benefit from linguistic tools, especially\n", "tokenisers, pos-taggers and morphological taggers.\n", "Whether Alpino can help depends on the languages that are supported by Alpino, because up till now Text-Fabric deals with historical corpora using mixtures of historical languages with a variety\n", "of spelling idiosyncrasies.\n", "\n", "## Text-Fabric helps Alpino\n", "\n", "When end-users want to combine close reading with data-analysis, Text-Fabric is a handy\n", "tool. Although Text-Fabric cannot deal with huge corpora, it can deal with corpora the size\n", "of 5 million words with dozens of features or 0,5 million words and over hundred features.\n", "\n", "Text-Fabric also has machinery to deal with volumes of a corpus.\n", "\n", "So one could use Alpino to make a top-level search on a huge corpus, and then export\n", "results volume by volume, where Text-Fabric can be used to deal with individual volumes.\n", "\n", "# Conclusion\n", "\n", "Regardless of whether there is real synergy between Alpino and Text-Fabric, it is encouraging\n", "to see that once text is regarded as a graph, there is a certain logic in how a query language\n", "should work. Both Alpino and Text-Fabric have hit upon the same elements, driven by how\n", "other tools have tackled these things.\n", "\n", "Whereas Alpino rests on Cypher (at least for this part of its query capability), Text-Fabric\n", "has been inspired by [Emdros](https://emdros.org/whatis.html) by Ulrik Sandborg-Petersen." ] }, { "cell_type": "markdown", "id": "87be178b-b55d-4db3-ac4a-733fff95c056", "metadata": {}, "source": [ "# Author\n", "\n", "Dirk Roorda\n", "\n", "CC-BY" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.1" }, "toc-autonumbering": false, "toc-showmarkdowntxt": false, "toc-showtags": false }, "nbformat": 4, "nbformat_minor": 5 }