# Use TSV data

We show how to work with the TSV data from the Lakhnawi PDF.

Fusus has a function to import TSV data that is coming out of the OCR pipeline and out of the text extraction pipeline.

These have slightly different columns.
When unpacking the TSV data, the function will cast the appropriate columns to integer.

Reference: [convert](https://among.github.io/fusus/fusus/convert.html).

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from fusus.convert import loadTsv

For a known work, such as the Lakhnawi edition of the Fusus,
we can use a keyword, see [works](https://among.github.io/fusus/fusus/works.html).

# Lakhnawi

## By acronym

In [3]:
(headers, words) = loadTsv(source="fususl")

Loading TSV data from ~/github/among/fusus/ur/Lakhnawi/allpages.tsv


We get the header fields and the words:

In [4]:
print(headers)

('page', 'line', 'column', 'span', 'direction', 'left', 'top', 'right', 'bottom', 'word')


In [5]:
len(words)

51814

In [6]:
print(words[40000])

(355, 12, 1, 1, 'r', 390, 373, 390, 394, 'َّىٰ')


## By path

Alternatively, we could have gotten it as follows:

In [11]:
(headers, words) = loadTsv(source="~/github/among/fusus/ur/Lakhnawi/allpages.tsv", ocred=False)

Loading TSV data from ~/github/among/fusus/ur/Lakhnawi/allpages.tsv


In [12]:
print(headers)
print(len(words))
print(words[40000])

('page', 'line', 'column', 'span', 'direction', 'left', 'top', 'right', 'bottom', 'word')
51814
(355, 12, 1, 1, 'r', 390, 373, 390, 394, 'َّىٰ')


# Afifi

## By acronym

In [13]:
(headers, words) = loadTsv(source="fususa")

Loading TSV data from ~/github/among/fusus/ur/Affifi/allpages.tsv


We get the header fields and the words:

In [14]:
print(headers)

('stripe', 'column', 'line', 'left', 'top', 'right', 'bottom', 'confidence', 'text')


In [15]:
len(words)

46264

In [16]:
print(words[40000])

(203, 0, '', 18, 904, 3266, 1058, 3429, 100, 'وجه')


## By path

Alternatively, we could have gotten it as follows:

In [17]:
(headers, words) = loadTsv(source="~/github/among/fusus/ur/Afifi/allpages.tsv", ocred=True)

Loading TSV data from ~/github/among/fusus/ur/Affifi/allpages.tsv


In [18]:
print(headers)
print(len(words))
print(words[40000])

('stripe', 'column', 'line', 'left', 'top', 'right', 'bottom', 'confidence', 'text')
46264
(203, 0, '', 18, 904, 3266, 1058, 3429, 100, 'وجه')
