{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bounding boxes\n", "\n", "Every word in the corpus has bounding box information stored in the features\n", "`boxl`, `boxt`, `boxr`, `boxb`, which store the coordinates of the left, top, right, bottom boundaries.\n", "\n", "For top en bottom, they are the $y$-coordinates, and for left and right they are the $x$ coordinates.\n", "\n", "The origin is the top left of the page. \n", "\n", "The $x$ coordinates increase when going to the right, the $y$ coordinates increase when going down.\n", "\n", "We show what you can do with this information." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from tf.app import use" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We load version `0.4`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "data: ~/github/among/fusus/tf/Lakhnawi/0.4" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Text-Fabric: Text-Fabric API 8.4.12, no app configured
Data: among/fusus/tf/Lakhnawi/0.4
Features:
among/fusus/tf/Lakhnawiboxb
boxl
boxr
boxt
dir
ln
n
nice
np
otype
plain
post
posta
pre
prea
text
title
trans
oslots
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Text-Fabric API: names N F E L T S C TF directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A = use(\"among/fusus/tf/Lakhnawi:clone\", version=\"0.4\", writing=\"ara\", hoist=globals())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Multiple words in one box\n", "\n", "In version 0.4 the following was the case:\n", "\n", "When words are not separated by space, but by punctuation marks, they end up in one box.\n", "\n", "So, some words have exactly the same bounding box.\n", "\n", "Let's find them.\n", "\n", "It turns out that Text-Fabric search has a primitive that comes in handy: we can compare features\n", "of different nodes.\n", "\n", "We search in each line, look for two adjacent words with the same left and right edges." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "templateMultiple = \"\"\"\n", "line\n", " w1:word\n", " < w2:word\n", " \n", "w1 .boxr. w2\n", "w1 .boxl. w2\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.70s 578 results\n" ] } ], "source": [ "results = A.search(templateMultiple)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
nplinewordword
11 1:9بيروت٤٣٤١هـ–٣١٠٢م
21 4:2١‐نماذج
31 4:2٣٣٩١……………………أ
41 4:3٢‐عنوانكتاب
51 4:3الكلم………………………٦
61 4:4٣‐خطبة
71 4:4الكلم………………………٨
81 4:5٤‐[١]
91 4:5٤‐[فصّ
101 4:5١]فصّ
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table(results, start=1, end=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we also stipulate that the two words are adjacent, in the sense that they occupy subsequent slots?\n", "\n", "If more than two words occupy the same bounding box, we should get less results." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "templateAdjacent = \"\"\"\n", "line\n", " w1:word\n", " <: w2:word\n", " \n", "w1 .boxr. w2\n", "w1 .boxl. w2\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.25s 557 results\n" ] } ], "source": [ "results = A.search(templateAdjacent)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
nplinewordword
11 1:9بيروت٤٣٤١هـ–٣١٠٢م
21 4:2١‐نماذج
31 4:2٣٣٩١……………………أ
41 4:3٢‐عنوانكتاب
51 4:3الكلم………………………٦
61 4:4٣‐خطبة
71 4:4الكلم………………………٨
81 4:5٤‐[١]
91 4:5١]فصّ
101 4:5آدميّة.………………………………٤١
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table(results, start=1, end=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, from version 0.5 we have split words in an earlier stage, keeping a good connection between the words and\n", "their bounding boxes.\n", "\n", "Let's load that version of the TF data and repeat the queries." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "data: ~/github/among/fusus/tf/Lakhnawi/0.5" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Text-Fabric: Text-Fabric API 8.4.12, no app configured
Data: among/fusus/tf/Lakhnawi/0.5
Features:
among/fusus/tf/Lakhnawiboxb
boxl
boxr
boxt
dir
letters
lettersn
lettersp
letterst
ln
n
np
otype
punc
punca
title
oslots
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Text-Fabric API: names N F E L T S C TF directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A = use(\"among/fusus/tf/Lakhnawi:clone\", version=\"0.5\", writing=\"ara\", hoist=globals())" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.80s 0 results\n" ] } ], "source": [ "results = A.search(templateMultiple)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's better!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }