{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lines\n", "\n", "We show how the pipeline detects lines on the page and we provide\n", "critical examples to see how successful the method is.\n", "\n", "Reference: [lines](https://among.github.io/fusus/fusus/lines.html)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from fusus.book import Book" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "B = Book(cd=\"~/github/among/fusus/example\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# cd to the book directory\n", "!cd `pwd`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We show the line division in every block of text in every page.\n", "Check visually whether all lines have been detected correctly.\n", "\n", "The `histogram` stage shows the blocks that have been detected on the page,\n", "and within the blocks the histograms that correspond to the ink distribution.\n", "\n", "We mark the start and end of lines by orange and purple dots, which are\n", "obtained by a rolling median filter over the first and last black pixel position on each pixel line.\n", "\n", "In each block the main line bands are shown.\n", "A green rule marks the start of a band, a red rule the end.\n", "The space between bands is greyed out.\n", "We show the `main` bands, which are derived directly from the histogram.\n", "\n", "The main bands may not contain *all* the ink, but do not worry: the bands are used to target\n", "the cleaning of marks, and are not visible to the rest of the processing stages.\n", "\n", "Check in particular:\n", "* whether short lines have been detected\n", "* whether consecutive lines are not treated as one line.\n", "\n", "Page 101 is a critical page: both errors are likely to occur!" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def checkLines(pg, quiet=True, **kwargs):\n", " if pg is None:\n", " for pg in B.allPagesList:\n", " page = B.process(\n", " batch=False,\n", " pages=pg,\n", " doOcr=False,\n", " uptoLayout=True,\n", " quiet=quiet,\n", " **kwargs,\n", " )\n", " page.show(stage=\"histogram\")\n", " else:\n", " page = B.process(\n", " batch=False,\n", " pages=pg,\n", " doOcr=False,\n", " uptoLayout=True,\n", " quiet=quiet,\n", " **kwargs,\n", " )\n", " page.show(stage=\"histogram\")\n", " return page" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# B.configure(blurY=None, peakSignificant=0.1, peakProminenceY=None, valleyProminenceY=None, debug=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check a single page" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Batch of 1 pages: 101\n", " 0.00s Start batch processing images\n", " | | | -0.00s 1 101.jpg \n", " 1.27s all done\n" ] }, { "data": { "text/html": [ "