{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(label-propagation-page)=\n", "# Propagation of NER labels\n", "\n", "In the context of Named Entity Recognition (NER), typical datasets contain the text tokens and the NER labels for each of the tokens. For example:\n", "\n", "````{margin}\n", "```{note}\n", "`B-P` is short for \"Beginning-Place\"\n", "and `I-P` is short for \"Inside-Place\"\n", "whereas `O` means \"Other\".\n", "See [IOB Tagging](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) for more details\n", "```\n", "````\n", " NER Labels: B-P I-P O O\n", " Text: New York is big\n" ] }, { "source": [ "Now, imagine we have obtained a noisy version of the grouth truth text through the OCR process, for example. The problem becomes: how can we label the noisy tokens?\n", "\n", "\n", " NER Labels: B-P I-P O O\n", " GT Text: New York is big\n", " Noisy Text: New Yo rkis big\n", " NER Labels: ? ? ? ?\n", "\n", "We can utilize text alignment and **propagate** the NER labels onto the noisy tokens. We will demonstrate how in the rest of this document.\n" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "## Tokenization\n", "\n", "To ensure consistent interpretation of the text alignment results, we need to first tokenize the grouth truth and the OCR'ed (nosiy) text." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from genalog.text import ner_label\n", "from genalog.text import preprocess\n", "\n", "gt_txt = \"New York is big\"\n", "ocr_txt = \"New Yo rkis big\"\n", "\n", "# Input to the method\n", "gt_labels = [\"B-P\", \"I-P\", \"O\", \"O\"]\n", "gt_tokens = preprocess.tokenize(gt_txt) # tokenize into list of tokens\n", "ocr_tokens = preprocess.tokenize(ocr_txt)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['B-P', 'I-P', 'O', 'O']\n", "['New', 'York', 'is', 'big']\n", "['New', 'Yo', 'rkis', 'big']\n" ] } ], "source": [ "# Inputs to the method\n", "print(gt_labels)\n", "print(gt_tokens)\n", "print(ocr_tokens)" ] }, { "source": [ "## Label Propagation\n", "\n", "We then can run label propagation to obtain the NER labels for the OCR'ed (noisy) tokens." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Method returns a tuple of 4 elements (gt_tokens, gt_labels, ocr_tokens, ocr_labels, gap_char)\n", "ocr_labels, aligned_gt, aligned_ocr, gap_char = ner_label.propagate_label_to_ocr(gt_labels, gt_tokens, ocr_tokens)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "OCR labels: ['B-P', 'I-P', 'I-P', 'O']\n", "Aligned ground truth: New Yo@rk is big\n", "Alinged OCR text: New Yo rk@is big\n" ] } ], "source": [ "# Outputs\n", "print(f\"OCR labels: {ocr_labels}\")\n", "print(f\"Aligned ground truth: {aligned_gt}\")\n", "print(f\"Alinged OCR text: {aligned_ocr}\")" ] }, { "source": [ "## Display Result After Propagation" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "B-P I-P O O \n", "New York is big \n", "New Yo@rk is big\n", "||||||.||.||||||\n", "New Yo rk@is big\n", "New Yo rkis big \n", "B-P I-P I-P O \n", "\n" ] } ], "source": [ "print(ner_label.format_label_propagation(gt_tokens, gt_labels, ocr_tokens, ocr_labels, aligned_gt, aligned_ocr))" ] }, { "source": [ "## Final Results\n", "\n", "Formatting the OCR tokens and their NER labels." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "B-P I-P I-P O \n", "New Yo rkis big \n", "\n" ] } ], "source": [ "# Format tokens and labels\n", "print(ner_label.format_labels(ocr_tokens, ocr_labels))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }