{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Use TSV data\n", "\n", "We show how to work with the TSV data from the Lakhnawi PDF.\n", "\n", "Fusus has a function to import TSV data that is coming out of the OCR pipeline and out of the text extraction pipeline.\n", "\n", "These have slightly different columns.\n", "When unpacking the TSV data, the function will cast the appropriate columns to integer.\n", "\n", "Reference: [convert](https://among.github.io/fusus/fusus/convert.html)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from fusus.convert import loadTsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a known work, such as the Lakhnawi edition of the Fusus,\n", "we can use a keyword, see [works](https://among.github.io/fusus/fusus/works.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lakhnawi\n", "\n", "## By acronym" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading TSV data from ~/github/among/fusus/ur/Lakhnawi/allpages.tsv\n" ] } ], "source": [ "(headers, words) = loadTsv(source=\"fususl\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get the header fields and the words:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('page', 'line', 'column', 'span', 'direction', 'left', 'top', 'right', 'bottom', 'word')\n" ] } ], "source": [ "print(headers)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "51814" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(words)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(355, 12, 1, 1, 'r', 390, 373, 390, 394, 'َّىٰ')\n" ] } ], "source": [ "print(words[40000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## By path\n", "\n", "Alternatively, we could have gotten it as follows:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading TSV data from ~/github/among/fusus/ur/Lakhnawi/allpages.tsv\n" ] } ], "source": [ "(headers, words) = loadTsv(source=\"~/github/among/fusus/ur/Lakhnawi/allpages.tsv\", ocred=False)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('page', 'line', 'column', 'span', 'direction', 'left', 'top', 'right', 'bottom', 'word')\n", "51814\n", "(355, 12, 1, 1, 'r', 390, 373, 390, 394, 'َّىٰ')\n" ] } ], "source": [ "print(headers)\n", "print(len(words))\n", "print(words[40000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Afifi\n", "\n", "## By acronym" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading TSV data from ~/github/among/fusus/ur/Affifi/allpages.tsv\n" ] } ], "source": [ "(headers, words) = loadTsv(source=\"fususa\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get the header fields and the words:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('stripe', 'column', 'line', 'left', 'top', 'right', 'bottom', 'confidence', 'text')\n" ] } ], "source": [ "print(headers)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "46264" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(words)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(203, 0, '', 18, 904, 3266, 1058, 3429, 100, 'وجه')\n" ] } ], "source": [ "print(words[40000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## By path\n", "\n", "Alternatively, we could have gotten it as follows:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading TSV data from ~/github/among/fusus/ur/Affifi/allpages.tsv\n" ] } ], "source": [ "(headers, words) = loadTsv(source=\"~/github/among/fusus/ur/Afifi/allpages.tsv\", ocred=True)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('stripe', 'column', 'line', 'left', 'top', 'right', 'bottom', 'confidence', 'text')\n", "46264\n", "(203, 0, '', 18, 904, 3266, 1058, 3429, 100, 'وجه')\n" ] } ], "source": [ "print(headers)\n", "print(len(words))\n", "print(words[40000])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }