{ "cells": [ { "cell_type": "markdown", "id": "b33ecd94-2835-4391-9f08-670b379d13de", "metadata": {}, "source": [ "# Advanced search options (Nestle1904GBI)" ] }, { "cell_type": "markdown", "id": "d53d318b-65fe-4980-8a48-daddd355f115", "metadata": {}, "source": [ "## Table of content \n", "* 1 - Introduction\n", "* 2 - Load Text-Fabric app and data\n", "* 3 - Performing the queries\n", " * 3.1 - TBD\n", " * 3.2 - Inspecting your query\n", " * 3.2 - Comparing two lists with query results\n", "* 4 - Footnotes and attribution\n", "* 5 - Required libraries" ] }, { "cell_type": "markdown", "id": "8ab013e2-c82e-4cd8-a54f-d696b1bebd41", "metadata": {}, "source": [ "# 1 - Introduction \n", "##### [Back to TOC](#TOC)\n", "\n", "TBD" ] }, { "cell_type": "markdown", "id": "dbc8843b-4930-4ab4-b92b-dbbfb76ca88f", "metadata": {}, "source": [ "# 2 - Load Text-Fabric app and data \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "code", "execution_count": 8, "id": "9cc1b0db-edf8-4795-af72-2f2be414d5d8", "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 9, "id": "6abd3aa9-e545-48f1-8392-33b63e28db6d", "metadata": {}, "outputs": [], "source": [ "# Loading the Text-Fabric code\n", "# Note: it is assumed Text-Fabric is installed in your environment\n", "from tf.fabric import Fabric\n", "from tf.app import use" ] }, { "cell_type": "code", "execution_count": 13, "id": "61625284-ca5a-4832-b188-c881d8efceca", "metadata": { "scrolled": true, "tags": [] }, "outputs": [ { "data": { "text/markdown": [ "**Locating corpus resources ...**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "app: ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " Text-Fabric: Text-Fabric API 11.4.10, tonyjurg/Nestle1904GBI/app v3, Search Reference
\n", " Data: tonyjurg - Nestle1904GBI 0.4, Character table, Feature docs
\n", "
Node types\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "
Name# of nodes# slots/node% coverage
book275102.93100
chapter260529.92100
sentence572024.09100
verse794317.35100
clause161248.54100
phrase726741.90100
word1377791.00100
\n", " Sets: no custom sets
\n", " Features:
\n", "
Nestle 1904 (GBI nodes)\n", "
\n", "\n", "
\n", "
\n", "after\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "book\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "booknum\n", "
\n", "
int
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "bookshort\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "case\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "chapter\n", "
\n", "
int
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "clause\n", "
\n", "
int
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "clauserule\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "clausetype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "degree\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "formaltag\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "functionaltag\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "gloss\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "gn\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "lemma\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "lex_dom\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "ln\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "monad\n", "
\n", "
int
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "mood\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "nodeID\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "normalized\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "nu\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "number\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "person\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "phrase\n", "
\n", "
int
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "phrasefunction\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "phrasefunctionlong\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "phrasetype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "sentence\n", "
\n", "
int
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "sp\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "splong\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "strongs\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "subj_ref\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "tense\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "type\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "verse\n", "
\n", "
int
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "voice\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "word\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Text-Fabric API: names N F E L T S C TF directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# load the N1904 app and data\n", "N1904 = use (\"tonyjurg/Nestle1904GBI\", version=\"0.4\", hoist=globals())" ] }, { "cell_type": "code", "execution_count": 12, "id": "6082627a-66cc-4620-88b6-84998f4640ad", "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)\n", "N1904.dh(N1904.getCss())" ] }, { "cell_type": "markdown", "id": "6cee9712-ae79-477c-9f39-1a51e3a0a8a1", "metadata": {}, "source": [ "# 3 - Performing the queries \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "markdown", "id": "e2b752ab-1b2a-46f3-a079-fe0a45555b8d", "metadata": {}, "source": [ "## 3.1 - TBD\n", "##### [Back to TOC](#TOC)\n", "\n", "TBD" ] }, { "cell_type": "markdown", "id": "47a87214-feb2-4b04-9057-91f302ba600c", "metadata": {}, "source": [ "## 3.2 - Inspecting your query\n", "##### [Back to TOC](#TOC)\n", "\n", "Each query templace can be inspected by use of [`S.study()`](https://annotation.github.io/text-fabric/tf/search/search.html#tf.search.search.Search.study). This is particulary helpfull in case the query is complicated." ] }, { "cell_type": "code", "execution_count": 15, "id": "2352983f-398d-4cd6-afbc-a7ef83d25602", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.00s Setting up search space for 2 objects ...\n", " | 0.00s \"Quantifier on \"parent:clause\"\n", " | | /where/\n", " | | parent:clause\n", " | | phrase phrasefunction=S\n", " | | 0.08s 11391 matching nodes\n", " | | /have/\n", " | | parent:clause\n", " | | phrase phrasefunction=S\n", " | | /without/\n", " | | word sp#conj\n", " | | /-/\n", " | | /-\n", " | | | /without/\n", " | | | parent:phrase phrasefunction=S\n", " | | | word sp#conj\n", " | | | /-\n", " | | | 0.21s 10976 nodes to exclude\n", " | | 0.23s reduction from 11391 to 415 nodes\n", " | | 0.23s 415 matching nodes\n", " | | 0.24s 8783 match antecedent but not consequent\n", " | 0.41s reduction from 16124 to 7341 nodes\n", " 0.26s Constraining search space with 1 relations ...\n", " 0.27s \t1 edges thinned\n", " 0.27s Setting up retrieval plan with strategy small_choice_multi ...\n", " 0.27s Ready to deliver results from 8849 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n" ] } ], "source": [ "ComplicatedQuery = '''\n", "clause\n", "/where/\n", " phrase phrasefunction=S\n", "/have/\n", " /without/\n", " word sp#conj\n", " /-/\n", "/-/\n", " phrase phrasefunction=O\n", "'''\n", "S.study(ComplicatedQuery)" ] }, { "cell_type": "code", "execution_count": 16, "id": "2fbd91b5-5092-4018-a9af-fa63a2a934be", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 4.00s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 clause\n", " 2 /where\n", " 3 phrase phrasefunction=S\n", " 4 /have\n", " 5 /without\n", " 6 word sp#conj\n", " 7 /-\n", " 8 /-\n", " 9 R1 phrase phrasefunction=O\n", "10 \n" ] } ], "source": [ "S.showPlan()" ] }, { "cell_type": "code", "execution_count": 17, "id": "83ee75bd-f2c3-4c91-b0fe-a3663c812873", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 6.12s clause[πολλαπλασίονα λήμψεται καὶ ζωὴν αἰώνιον ...] phrase[πολλαπλασίονα ]\n", " 6.12s clause[πολλαπλασίονα λήμψεται καὶ ζωὴν αἰώνιον ...] phrase[ζωὴν αἰώνιον ]\n", " 6.12s clause[συμφωνήσας δὲ μετὰ τῶν ἐργατῶν ...] phrase[αὐτοὺς ]\n", " 6.13s clause[καὶ ἐξελθὼν περὶ τρίτην ὥραν ...] phrase[ἄλλους ]\n", " 6.13s clause[Τί οὖν ἐροῦμεν; ] phrase[Τί οὖν ]\n", " 6.13s clause[διὰ τοῦ ἀγαθοῦ μοι κατεργαζομένη ...] phrase[θάνατον, ]\n", " 6.13s clause[καὶ ἴσους αὐτοὺς ἡμῖν ἐποίησας ...] phrase[αὐτοὺς ]\n", " 6.13s clause[καὶ ἴσους αὐτοὺς ἡμῖν ἐποίησας ...] phrase[τὸ βάρος τῆς ἡμέρας καὶ ...]\n", " 6.13s clause[οὐκ ἀδικῶ σε· οὐχὶ δηναρίου ...] phrase[σε· οὐχὶ ]\n", " 6.13s clause[ἆρον τὸ σὸν καὶ ὕπαγε· ...] phrase[τὸ σὸν καὶ ]\n" ] } ], "source": [ "for result in S.fetch(limit=10):\n", " TF.info(S.glean(result))" ] }, { "cell_type": "markdown", "id": "e3d3c434-6b7a-435e-9525-d2096866e604", "metadata": {}, "source": [ "## 3.3 - Comparing two lists with query results\n", "##### [Back to TOC](#TOC)\n", "\n", "Using python standard functions, it apears to be easy to verify if the result of two queries are the same or different. However, it is good to have a closer look at the matter since there are a few pitfalls that could lead to false results." ] }, { "cell_type": "code", "execution_count": 39, "id": "a57e9482-6843-4f69-a965-3803eac00fe3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.16s 2216 results\n", " 0.15s 2216 results\n", "Same result? True\n" ] } ], "source": [ "# define query template 1\n", "SomeQuery ='''\n", "phrase phrasefunction=V\n", " word lemma=λέγω\n", "'''\n", "# now create two lists with identical query results and compare result lists\n", "SomeResult1=N1904.search(SomeQuery)\n", "SomeResult2=N1904.search(SomeQuery)\n", "print(f'Same result? {SomeResult1 == SomeResult2}')" ] }, { "cell_type": "markdown", "id": "ea751b45-2398-4ea4-b3cf-81808cae3aa4", "metadata": {}, "source": [ "This is exactly what we would expect. But comparing lists can be tricky. Consider the following two queries." ] }, { "cell_type": "code", "execution_count": 25, "id": "5373b900-a8fe-4079-8aeb-bf7f2ee5e64c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.34s 3851 results\n", " 0.35s 3851 results\n", "Same result? False\n" ] } ], "source": [ "# define query template 1\n", "Query1 ='''\n", "phrase\n", " a:word sp=prep\n", " b:word sp=adj\n", " c:word sp=noun \n", "'''\n", "\n", "# define query template 2\n", "Query2 ='''\n", "phrase\n", " a:word sp=prep\n", " b:word sp=noun\n", " c:word sp=adj\n", "'''\n", "\n", "# create and compare result lists\n", "ResultQuery1=N1904.search(Query1)\n", "ResultQuery2=N1904.search(Query2)\n", "\n", "print(f'Same result? {ResultQuery1 == ResultQuery2}')" ] }, { "cell_type": "markdown", "id": "702723ab-2a52-41da-934e-7c4b36118e2b", "metadata": {}, "source": [ "This method of comparing the lists results identifies a difference between the two lists. However, upon closer examination, that may or may not be the case, depending on what is understood as difference. The 'problem' here is that both ResultQuery1 and ResultQuery2 are lists of **ordered tuples**. The swapping of the feature conditions 'sp=adj' and 'sp=noun' did not result in a different set of results; only the presentation of rhe result differed. " ] }, { "cell_type": "code", "execution_count": 33, "id": "a38422bf-24fb-435c-a156-a6bee707986b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.34s 3851 results\n", " 0.33s 3851 results\n", "Unsorted lists: Same result ? False\n", "Sorted lists: Same result ? False\n" ] } ], "source": [ "# create 2 result lists\n", "ResultQuery3=N1904.search(Query1,sort=True)\n", "ResultQuery4=N1904.search(Query1,sort=False)\n", "\n", "# compare unsorted lists\n", "print(f'Unsorted lists: Same result ? {ResultQuery3 == ResultQuery4}')\n", "\n", "# sort both lists on the first tuple \n", "SortedResultQuery3 = sorted(ResultQuery3, key=lambda x: x[0])\n", "SortedResultQuery4 = sorted(ResultQuery4, key=lambda x: x[0])\n", "\n", "# compare sorted lists\n", "print(f'Sorted lists: Same result ? {SortedResultQuery3 == SortedResultQuery4}')" ] }, { "cell_type": "markdown", "id": "979830b3-0714-4824-80eb-8a484062e73c", "metadata": {}, "source": [ "Unexpectedly the python list compare still viewed the two lists as different. But why? Python does report that the two lists are different because the comparison of lists (SortedResultQuery3 == SortedResultQuery4) checks for the equality of the list objects, not their contents.\n", "\n", "The search() function in Text-Fabric returns a list of nodes or tuples representing search results. Even if the search criteria and the data are the same, the two lists, ResultQuery3 and ResultQuery4, are distinct list objects. Hence, when comparing them directly using the == operator, Python considers them as different objects, resulting in False.\n", "\n", "To compare the content of the lists, it is advices to first onvert them to sets and compare those sets. See following example: " ] }, { "cell_type": "code", "execution_count": 35, "id": "3aa8e50e-d7ed-4724-9d23-b2b81dbebda2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lists ResultQuery3 and ResultQuery4 are equal.\n" ] } ], "source": [ "# Convert tuples to sets\n", "set1 = set(tuple(item) for item in ResultQuery3)\n", "set2 = set(tuple(item) for item in ResultQuery4)\n", "\n", "# Compare the sets\n", "if set1 == set2:\n", " print(\"Lists ResultQuery3 and ResultQuery4 are equal.\")\n", "else:\n", " print(\"Lists ResultQuery3 and ResultQuery4 are not equal.\")" ] }, { "cell_type": "markdown", "id": "d6c70d84-b455-4213-9827-049a03dd21fd", "metadata": {}, "source": [ "Now, let's compare ResultQuery1 and ResultQuery2 again by first converting them to sets. " ] }, { "cell_type": "code", "execution_count": 38, "id": "3302530c-0dd3-4062-882c-087d00c72b71", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lists ResultQuery1 and ResultQuery2 are not equal.\n" ] } ], "source": [ "# Convert tuples to sets\n", "set1 = set(tuple(item) for item in ResultQuery1)\n", "set2 = set(tuple(item) for item in ResultQuery2)\n", "\n", "# Compare the sets\n", "if set1 == set2:\n", " print(\"Lists ResultQuery1 and ResultQuery2 are equal.\")\n", "else:\n", " print(\"Lists ResultQuery1 and ResultQuery2 are not equal.\")" ] }, { "cell_type": "markdown", "id": "eb6b6ec3-6079-4bb9-8edb-14cad4bae771", "metadata": {}, "source": [ "This is indeed the result we expected (see earlier mentioned reasons)." ] }, { "cell_type": "markdown", "id": "6f18dede-c716-4ba1-a396-554e3390c605", "metadata": { "tags": [] }, "source": [ "# 4 - Footnotes and attribution \n", "##### [Back to TOC](#TOC)\n", "\n", "N.A." ] }, { "cell_type": "markdown", "id": "30bcd466-c196-48df-8f6e-5911700886e7", "metadata": { "tags": [] }, "source": [ "# 5 - Required libraries \n", "##### [Back to TOC](#TOC)\n", "\n", "The scripts in this notebook require (beside `text-fabric`) the following Python libraries to be installed in the environment:\n", "\n", " {none}\n", "\n", "You can install any missing library from within Jupyter Notebook using either`pip` or `pip3`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }