{ "cells": [ { "cell_type": "markdown", "id": "7f21a829", "metadata": {}, "source": [ "\n", "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", " Try in Google Colab\n", " \n", " \n", " \n", " \n", " Share via nbviewer\n", " \n", " \n", " \n", " \n", " View on GitHub\n", " \n", " \n", " \n", " \n", " Download notebook\n", " \n", "
\n" ] }, { "cell_type": "markdown", "id": "f9b00a8a-c350-4658-9c00-2f218e2a2892", "metadata": {}, "source": [ "# Horizontal Text Detection with OpenVino Model using FiftyOne tool" ] }, { "cell_type": "markdown", "id": "1fc5eb47-d86e-4211-ac46-c127be994a81", "metadata": {}, "source": [ "This notebook is a demonstration of the usage of Intel's OpenVino [horizontal text detections model](https://docs.openvino.ai/latest/omz_models_model_horizontal_text_detection_0001.html) for the horizontal text detection on [Total Text Dataset](https://www.kaggle.com/datasets/ipythonx/totaltextstr) with the help of the open source tool [FiftyOne](https://docs.voxel51.com/index.html). \n", "The notebook goes through the steps of loading the dataset of images with groundtruth detections into FiftyOne, visualizing the images and adding predictions from the model and evaluating those predictions against the ground truth" ] }, { "cell_type": "markdown", "id": "f1d93744-6ebf-4869-8ba3-7799985b8652", "metadata": {}, "source": [ "### Prequisites\n", " - Get the [Total Text Dataset](https://www.kaggle.com/datasets/ipythonx/totaltextstr) downloaded in the same path where you have \n", " where you have the notebook\n", " - Run the below code cell to get the required python libraries" ] }, { "cell_type": "code", "execution_count": null, "id": "25a55346-f3f7-4307-b268-dcc1737d0a47", "metadata": {}, "outputs": [], "source": [ "!pip install fiftyone cv2 openvino numpy wget" ] }, { "cell_type": "markdown", "id": "9976d32b-2a79-429c-b80d-8b2c5808d5c7", "metadata": {}, "source": [ "## Imports " ] }, { "cell_type": "code", "execution_count": 1, "id": "fc13ddac-9999-4a98-865b-955c796ff2b8", "metadata": {}, "outputs": [], "source": [ "import fiftyone as fo\n", "import os\n", "import glob\n", "import cv2\n", "import re\n", "import numpy as np\n", "import wget\n", "from openvino.runtime import Core" ] }, { "cell_type": "markdown", "id": "4b35a196-6696-44f0-9358-c9f8453dedf8", "metadata": {}, "source": [ "## Total-Text Dataset\n", "The Total-Text folder contains three folders: \n", " - Train\n", " - Test\n", " - Annotation" ] }, { "cell_type": "markdown", "id": "df9cd32c-1557-4e2c-a165-1ef27523c739", "metadata": {}, "source": [ "To add samples to the dataset, we are going to loop through the images in the Train and Test folders and use the ground_truth_polygonal annotations text files to get the ground truth bounding box detections and labels. There are other groundtruths available in the Annotation folder such as character level and text region mask which user can use based on their model evaluation. For the purpose of this notebook, we are looking at the horizontal bounding box text detection." ] }, { "cell_type": "markdown", "id": "947e0b19-43df-4ce1-9170-c38eb98ed0ee", "metadata": {}, "source": [ "## Load dataset into FiftyOne" ] }, { "cell_type": "code", "execution_count": null, "id": "88b058e4-d9a9-4eca-a984-fd84dc8bada3", "metadata": {}, "outputs": [], "source": [ "# Create samples for your data\n", "samples = []\n", "for dataname in ['Train', 'Test']:\n", " \n", " # Looping through the Train and Test folder paths to add samples to the dataset\n", " images_patt = \"./Total-Text/\"+dataname+\"/*\"\n", " \n", " for filepath in glob.glob(images_patt):\n", " \n", " # Creating image samples with their respective sample tags, Train or Test \n", " # Tags gives you flexibility to use only samples present in Train or Test data\n", " sample = fo.Sample(filepath=filepath,tags=[dataname])\n", " \n", " # Get height, width of image\n", " img = cv2.imread(filepath, cv2.IMREAD_UNCHANGED)\n", " height = img.shape[0]\n", " width = img.shape[1]\n", " \n", " # Getting the filename from the filepath using the split operation\n", " # Ex: img1001 from Train folder image file path './Total-Text/Test\\\\img1.jpg'\n", " # Check the separator for the first split based on the OS\n", " filename=filepath.split(\"\\\\\")[-1].split(\".\")[0]\n", " \n", " # List of test images that are avoided due to incorrect formatting of their polygonal annnotations .txt file\n", " # The correct polygonal annotations format should be \n", " # \"x: [[153 161 179 195 184 177]], y: [[347 323 305 315 331 357]], ornt: [u'c'], transcriptions: [u'the']\\n\"\n", " test_images_to_avoid=['img551','img621','img623']\n", " \n", " if filename not in test_images_to_avoid:\n", " \n", " # Path to polygonal annotation text file\n", " annotation_path=\"./Total-Text/Annotation/groundtruth_polygonal_annotation/\"+dataname+\"/poly_gt_\"+filename+\".txt\"\n", " \n", " with open(annotation_path, \"r\") as f:\n", " \n", " # Each polygonal annotation text file is read line by line\n", " # For each line we try to extract key-value pairs using regular expressions python library\n", " # x and y are coordinates for bounding boxes, ornt is the orientation of the text\n", " # transcription gives us the text value detected\n", " polylines = [] \n", " lines = f.readlines()\n", " \n", " for line in lines:\n", " \n", " # Using the findall function of the re library we extract the values of \n", " # x, y, ornt, and transcriptions by pattern matching\n", " # For example, for x and y we are looking for number using \\d\n", " # for ornt and transcription we are matching with alphabets both\n", " # small and Capital as well as '#' in some cases \n", " x = re.findall(r'\\d+\\.\\d+|\\d+', line.split(',')[0])\n", " y = re.findall(r'\\d+\\.\\d+|\\d+', line.split(',')[1])\n", " \n", " # In case of ornt and transcription, we have an extra check where there are no\n", " # values or its empty\n", " if(len(re.findall(r'[a-z]+|\\#', line.split(',')[2])))==3:\n", " ornt = re.findall(r'[a-z]+|\\#', line.split(',')[2])[2]\n", " else:\n", " ornt = \"no_value\" \n", " if(len(re.findall(r'[A-Za-z]+|\\#|\\d+', line.split(',')[3])))==3:\n", " transcriptions = re.findall(r'[A-Za-z]+|\\#|\\d+', line.split(',')[3])[2]\n", " else:\n", " transcriptions = \"no_label\"\n", " \n", " # normalize x and y values between 0 and 1 using the image height and width\n", " x = [round(float(i)/width, 2) for i in x]\n", " y = [round(float(i)/height, 2) for i in y]\n", " \n", " # get in the format of lists of lists of tuples\n", " points = [list(zip(x, y))]\n", " \n", " # Create polyline label\n", " polyline=fo.Polyline(points=points,closed=True)\n", " # In case of evaluating model that detects text label, you can modify the above code\n", " # by adding label parameter to the fo.PolyLine method as label=transcriptions. \n", " # The other way around in the absence of labels for both groundtruths and model \n", " # predictions is to use an arbitrary label such as label='detected'\n", " \n", " polylines.append(polyline)\n", " \n", " # Adding polyline labels for each samples\n", " sample[\"ground_truth_polylines\"] = fo.Polylines(polylines=polylines)\n", " # Adding groundtruth labels with bounding box representation \n", " # tighly enclosing the polylines\n", " sample[\"ground_truth\"] = sample[\"ground_truth_polylines\"].to_detections()\n", " samples.append(sample)\n", " \n", "# Create dataset\n", "dataset = fo.Dataset(\"Total-Text-dataset-FO-1\")\n", "dataset.add_samples(samples) " ] }, { "cell_type": "markdown", "id": "eba8dc3f-c45b-44d3-852c-96615a7cf7a8", "metadata": {}, "source": [ "## Launch the fiftyone app to view the dataset" ] }, { "cell_type": "code", "execution_count": 4, "id": "db979a19-9293-482f-96bf-2b792c91b52c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", " \n", "
\n", " \n", "
\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "session = fo.launch_app(dataset=dataset)" ] }, { "cell_type": "code", "execution_count": 5, "id": "dc297f8e-ef73-4929-8bb7-99e765918058", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: Total-Text-dataset-FO-1\n", "Media type: image\n", "Num samples: 1552\n", "Persistent: False\n", "Tags: []\n", "Sample fields:\n", " id: fiftyone.core.fields.ObjectIdField\n", " filepath: fiftyone.core.fields.StringField\n", " tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n", " metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n", " ground_truth_polylines: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Polylines)\n", " ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n" ] } ], "source": [ "# Print some information about the dataset\n", "print(dataset)" ] }, { "cell_type": "code", "execution_count": 6, "id": "e27f0754-30ce-481b-bee4-74855f5480b2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# Print a ground truth detection\n", "sample = dataset.first()\n", "print(sample.ground_truth.detections[0])" ] }, { "cell_type": "markdown", "id": "c2fc9393-4128-4a73-a586-ae0bbea61417", "metadata": {}, "source": [ "## Add predictions to dataset" ] }, { "cell_type": "code", "execution_count": 7, "id": "10785bd7-e37f-41b6-a651-cc0869983c40", "metadata": {}, "outputs": [], "source": [ "# predictions_view = dataset.match_tags([\"Train\", \"Test\"]) # for whole dataset\n", "predictions_view = dataset.match_tags([\"Test\"]) # for test dataset" ] }, { "cell_type": "markdown", "id": "f0229063-7242-4652-b026-066b5b6509f3", "metadata": {}, "source": [ "## Get the OpenVino Model\n", "\n", "An OpenVINO IR (Intermediate Representation) model consists of an .xml file, containing information about network topology, and a .bin file, containing the weights and biases binary data. The read_model() function expects the .bin weights file to have the same filename and be located in the same directory as the .xml file." ] }, { "cell_type": "code", "execution_count": null, "id": "f6049521-3017-4b5d-90fa-b809f8a33895", "metadata": {}, "outputs": [], "source": [ "# Get the current working directory\n", "cwd = os.path.abspath(os.getcwd())\n", "\n", "# Create a new directory named 'model'\n", "os.mkdir(cwd+'/model')\n", "\n", "# Define the source urls\n", "model_xml_url = 'https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/main/notebooks/004-hello-detection/model/horizontal-text-detection-0001.xml'\n", "model_bin_url = 'https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/004-hello-detection/model/horizontal-text-detection-0001.bin?raw=true'\n", "\n", "# Define the destination file paths\n", "xml_path = cwd+'/model/horizontal-text-detection-0001.xml'\n", "bin_path = cwd+'/model/horizontal-text-detection-0001.bin'\n", "\n", "# Download the files to their respective paths\n", "wget.download(model_xml_url, out = xml_path)\n", "wget.download(model_bin_url, out = bin_path)" ] }, { "cell_type": "markdown", "id": "e2d2f415-abdb-4107-ad26-4619e18b74d6", "metadata": {}, "source": [ "### Let's load the OpenVino model" ] }, { "cell_type": "code", "execution_count": 8, "id": "3f74eebe-22b9-4c28-92f6-04201869f68b", "metadata": {}, "outputs": [], "source": [ "ie = Core()\n", "\n", "model = ie.read_model(model=\"./model/horizontal-text-detection-0001.xml\")\n", "compiled_model = ie.compile_model(model=model, device_name=\"CPU\")\n", "\n", "input_layer_ir = compiled_model.input(0)\n", "output_layer_ir = compiled_model.output(\"boxes\")" ] }, { "cell_type": "markdown", "id": "ed965601-8e04-4bc3-86ed-396a00646a61", "metadata": {}, "source": [ "### Add predictions to samples" ] }, { "cell_type": "code", "execution_count": 9, "id": "057eb4f7-de88-4b85-94f8-a72582f3c832", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 100% |█████████████████| 297/297 [19.2s elapsed, 0s remaining, 15.9 samples/s] \n" ] } ], "source": [ "with fo.ProgressBar() as pb:\n", " for sample in pb(predictions_view):\n", " # Text detection models expect an image in BGR format.\n", " img = cv2.imread(sample.filepath)\n", "\n", " # height, width of image\n", " height = img.shape[0]\n", " width = img.shape[1]\n", " # N,C,H,W = batch size, number of channels, height, width.\n", " N, C, H, W = input_layer_ir.shape\n", "\n", " # Resize the image to meet network expected input sizes.\n", " resized_image = cv2.resize(img, (W, H))\n", "\n", " # Reshape to the network input shape.\n", " input_image = np.expand_dims(resized_image.transpose(2, 0, 1), 0)\n", " \n", " # Create an inference request.\n", " boxes = compiled_model([input_image])[output_layer_ir]\n", "\n", " # Remove zero only boxes.\n", " boxes = boxes[~np.all(boxes == 0, axis=1)]\n", " \n", " # Getting the ratio of resized images and original image to avoid getting bounding boxes at wrong location\n", " (real_y, real_x), (resized_y, resized_x) = img.shape[:2], resized_image.shape[:2]\n", " ratio_x, ratio_y = real_x / resized_x, real_y / resized_y\n", "\n", " # Convert detections to FiftyOne format\n", " detections = []\n", " for i in range(len(boxes)):\n", " # Convert float to int and multiply corner position of each box by x and y ratio.\n", " # If the bounding box is found at the top of the image, \n", " # position the upper box bar little lower to make it visible on the image. \n", " (x1, y1, x2, y2) = [\n", " int(max(corner_position * ratio_y, 10)) if idx % 2 \n", " else int(corner_position * ratio_x)\n", " for idx, corner_position in enumerate(boxes[i][:-1])\n", " ]\n", " # Convert to [top-left-x, top-left-y, width, height]\n", " # in relative coordinates in [0, 1] x [0, 1]\n", " rel_box = [x1 / width, y1 / height, (x2 - x1) / width, (y2 - y1) / height]\n", " detections.append(\n", " fo.Detection( \n", " bounding_box=rel_box, \n", " confidence=boxes[i][4]\n", " )\n", " )\n", " sample[\"horizontal_detection\"] = fo.Detections(detections=detections)\n", " sample.save() " ] }, { "cell_type": "code", "execution_count": 10, "id": "a3b8f3de-1fce-4acb-a287-9a428e73d64d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", " \n", "
\n", " \n", "
\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "session.view = predictions_view" ] }, { "cell_type": "markdown", "id": "fd223be7-b059-4fcd-8c84-296e664d0e75", "metadata": {}, "source": [ "### Check the detections" ] }, { "cell_type": "code", "execution_count": 11, "id": "b76b5c94-a6d1-43f4-92c8-e23f623251f9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", " \n", "
\n", " \n", "
\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "session.show()" ] }, { "cell_type": "markdown", "id": "311d393e-9d5f-4556-9312-4f090533e4aa", "metadata": {}, "source": [ "## Evaluate detections" ] }, { "cell_type": "code", "execution_count": 12, "id": "eb87932e-e764-49b6-a8da-af0265362a57", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Evaluating detections...\n", " 100% |█████████████████| 297/297 [9.0s elapsed, 0s remaining, 39.4 samples/s] \n" ] } ], "source": [ "results = predictions_view.evaluate_detections(\n", " \"horizontal_detection\", gt_field=\"ground_truth\",use_boxes=True, classwise=False, eval_key=\"eval\"\n", ")" ] }, { "cell_type": "markdown", "id": "b83194b0-d02e-4da7-9129-611b6da30b91", "metadata": {}, "source": [ "The OpenVino's horizontal detection model only detects bounding box but return label for the text detected, therefore, while evaluating detection using `evaluate_detection` function, the `classwise` parameter is set to `False` and also ground_truth labels for text detected are not added to dataset. For a model that returns label for text detected, you can set `classwise` to `True` and add the ground truth labels to dataset." ] }, { "cell_type": "code", "execution_count": 13, "id": "4f2a5ff6-7711-477f-aa5f-4369473c14ec", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset: Total-Text-dataset-FO-1\n", "Media type: image\n", "Num patches: 13422\n", "Patch fields:\n", " id: fiftyone.core.fields.ObjectIdField\n", " sample_id: fiftyone.core.fields.ObjectIdField\n", " filepath: fiftyone.core.fields.StringField\n", " tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n", " metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n", " ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n", " horizontal_detection: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n", " crowd: fiftyone.core.fields.BooleanField\n", " type: fiftyone.core.fields.StringField\n", " iou: fiftyone.core.fields.FloatField\n", "View stages:\n", " 1. ToEvaluationPatches(eval_key='eval', config=None)\n" ] } ], "source": [ "# Convert to evaluation patches\n", "eval_patches = dataset.to_evaluation_patches(\"eval\")\n", "print(eval_patches)" ] }, { "cell_type": "code", "execution_count": 14, "id": "8249c03a-ccfc-4b9a-98f6-95860cb02dc9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{None: 10589, 'fp': 303, 'fn': 580, 'tp': 1950}\n" ] } ], "source": [ "print(eval_patches.count_values(\"type\"))" ] }, { "cell_type": "code", "execution_count": 15, "id": "b6770765-8727-43ac-ae2a-6932b523564f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# View patches in the App\n", "session.view = eval_patches" ] }, { "cell_type": "code", "execution_count": null, "id": "a024dc78-52c4-48ca-a2e5-83df2c1021db", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }