# Bounding boxes

Every word in the corpus has bounding box information stored in the features
`boxl`, `boxt`, `boxr`, `boxb`, which store the coordinates of the left, top, right, bottom boundaries.

For top en bottom, they are the $y$-coordinates, and for left and right they are the $x$ coordinates.

The origin is the top left of the page. 

The $x$ coordinates increase when going to the right, the $y$ coordinates increase when going down.

We show what you can do with this information.

In [1]:
from tf.app import use

We load version `0.4`.

In [2]:
A = use("among/fusus/tf/Lakhnawi:clone", version="0.4", writing="ara", hoist=globals())

# Multiple words in one box

In version 0.4 the following was the case:

When words are not separated by space, but by punctuation marks, they end up in one box.

So, some words have exactly the same bounding box.

Let's find them.

It turns out that Text-Fabric search has a primitive that comes in handy: we can compare features
of different nodes.

We search in each line, look for two adjacent words with the same left and right edges.

In [3]:
templateMultiple = """
line
  w1:word
  < w2:word
  
w1 .boxr. w2
w1 .boxl. w2
"""

In [4]:
results = A.search(templateMultiple)

  0.70s 578 results


In [5]:
A.table(results, start=1, end=10)

n,p,line,word,word.1
1,1 1:9,,بيروت٤٣٤١هـ–,٣١٠٢م
2,1 4:2,,١‐,نماذج
3,1 4:2,,٣٣٩١……………………,أ
4,1 4:3,,٢‐,عنوانكتاب
5,1 4:3,,الكلم………………………,٦
6,1 4:4,,٣‐,خطبة
7,1 4:4,,الكلم………………………,٨
8,1 4:5,,٤‐[,١]
9,1 4:5,,٤‐[,فصّ
10,1 4:5,,١],فصّ


What if we also stipulate that the two words are adjacent, in the sense that they occupy subsequent slots?

If more than two words occupy the same bounding box, we should get less results.

In [6]:
templateAdjacent = """
line
  w1:word
  <: w2:word
  
w1 .boxr. w2
w1 .boxl. w2
"""

In [7]:
results = A.search(templateAdjacent)

  0.25s 557 results


In [8]:
A.table(results, start=1, end=10)

n,p,line,word,word.1
1,1 1:9,,بيروت٤٣٤١هـ–,٣١٠٢م
2,1 4:2,,١‐,نماذج
3,1 4:2,,٣٣٩١……………………,أ
4,1 4:3,,٢‐,عنوانكتاب
5,1 4:3,,الكلم………………………,٦
6,1 4:4,,٣‐,خطبة
7,1 4:4,,الكلم………………………,٨
8,1 4:5,,٤‐[,١]
9,1 4:5,,١],فصّ
10,1 4:5,,آدميّة.………………………………,٤١


However, from version 0.5 we have split words in an earlier stage, keeping a good connection between the words and
their bounding boxes.

Let's load that version of the TF data and repeat the queries.

In [9]:
A = use("among/fusus/tf/Lakhnawi:clone", version="0.5", writing="ara", hoist=globals())

In [10]:
results = A.search(templateMultiple)

  0.80s 0 results


That's better!