{ "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" }, "orig_nbformat": 2, "kernelspec": { "name": "python3", "display_name": "Python 3.6.9 64-bit ('.env': venv)" }, "metadata": { "interpreter": { "hash": "463957e7759ed5c981e4d097e7f970bbf621ad48bd269f8044dc509b219ad94f" } }, "interpreter": { "hash": "463957e7759ed5c981e4d097e7f970bbf621ad48bd269f8044dc509b219ad94f" } }, "nbformat": 4, "nbformat_minor": 2, "cells": [ { "source": [ "# Generate your synthetic document\n", "\n", "\n", "```{figure} static/analog_doc_gen_pipeline.png\n", ":width: 500px\n", "```\n", "\n", "Genalog provides a simple interface (`AnalogDocumentGeneration`) to programmatic generate documents with degradation from a body of text." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from genalog.pipeline import AnalogDocumentGeneration\n" ] }, { "source": [ "## Configurations\n", "\n", "To use the pipeline, you will need to supply the following information:\n", "\n", "### CSS Style Combinations\n", "\n", "`STYLE_COMBINATIONS`: a dictionary defining the combination of styles to generate per text document (i.e. a copy of the same text document is generate per style combination)\n", "\n" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "STYLE_COMBINATIONS = {\n", " \"language\": [\"en_US\"],\n", " \"font_family\": [\"Segeo UI\"],\n", " \"font_size\": [\"12px\"],\n", " \"text_align\": [\"justify\"],\n", " \"hyphenate\": [True],\n", "}" ] }, { "source": [ "```{note}\n", "Genalog depends on Weasyprint as the engine to render these CSS styles. Most of these fields are standard CSS properties and accepts common values as specified in [W3C CSS Properties](https://www.w3.org/Style/CSS/all-properties.en.html). For details, please see [Weasyprint Documentation](https://weasyprint.readthedocs.io/en/stable/features.html#fonts).\n", "```" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "### Choose a Prebuild HTML Template\n", "\n", "`HTML_TEMPLATE`: name of html template used to generate the synthetic images. The `genalog` package has the following default templates: \n", "\n", "````{tab} columns.html.jinja\n", "```{figure} static/columns_Times_11px.png\n", ":width: 30%\n", "Document template with 2 columns \n", "```\n", "````\n", "````{tab} letter.html.jinja\n", "```{figure} static/letter_Times_11px.png\n", ":width: 30%\n", "Letter-like document template\n", "```\n", "````\n", "````{tab} text_block.html.jinja\n", "```{figure} static/text_block_Times_11px.png\n", ":width: 30%\n", "Simple text block template\n", "```\n", "````" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "HTML_TEMPLATE = \"text_block.html.jinja\"\n" ] }, { "source": [ "### Image Degradations\n", "\n", "`DEGRADATIONS`: a list defining the sequence of degradation effects applied onto the synthetic images. Each element is a two-element tuple of which the first element is one of the method names from `genalog.degradation.effect` and the second element is the corresponding function keyword arguments.\n" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "````{tab} bleed_through\n", "```{figure} static/bleed_through.png\n", ":name: Bleed-through\n", ":width: 90%\n", "Mimics a document printed on two sides. Valid values: [0,1].\n", "```\n", "````\n", "````{tab} blur\n", "```{figure} static/blur.png\n", ":name: Blur\n", ":width: 90%\n", "Lowers image quality. Unit are in number of pixels.\n", "```\n", "````\n", "````{tab} salt/pepper\n", "```{figure} static/salt_pepper.png\n", ":name: Salt/Pepper\n", ":width: 65%\n", "Mimics ink degradation. Valid values: [0, 1].\n", "```\n", "````\n", "`````{tab} close/dilate\n", "```{figure} static/close_dilate.png\n", ":name: Close/Dilate\n", "Degrades printing quality.\n", "```\n", "````{margin}\n", "```{note}\n", "For more details on this degradation, see [Morphilogical Operations](https://homepages.inf.ed.ac.uk/rbf/HIPR2/morops.htm)\n", "```\n", "````\n", "`````\n", "`````{tab} open/erode\n", "```{figure} static/open_erode.png\n", ":name: Open/Errode\n", "Ink overflows\n", "```\n", "````{margin}\n", "```{note}\n", "For more details on this degradation, see [Morphilogical Operations](https://homepages.inf.ed.ac.uk/rbf/HIPR2/morops.htm)\n", "```\n", "````\n", "`````" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from genalog.degradation.degrader import ImageState\n", "\n", "DEGRADATIONS = [\n", " (\"blur\", {\"radius\": 5}),\n", " (\"bleed_through\", {\n", " \"src\": ImageState.CURRENT_STATE,\n", " \"background\": ImageState.ORIGINAL_STATE,\n", " \"alpha\": 0.8,\n", " \"offset_x\": -6,\n", " \"offset_y\": -12,\n", " }),\n", " (\"morphology\", {\"operation\": \"open\", \"kernel_shape\":(9,9), \"kernel_type\":\"plus\"}),\n", " (\"pepper\", {\"amount\": 0.005}),\n", " (\"salt\", {\"amount\": 0.15}),\n", "]" ] }, { "source": [ "```{note}\n", "`ImageState.ORIGINAL_STATE` refers to the origin state of the image before applying any degradation, while\n", "`ImageState.CURRENT_STATE` refers to the state of the image after applying the last degradation effect.\n", "```\n", "\n", "The example above will apply degradation effects to synthetic images in the sequence of: \n", " \n", " blur -> bleed_through -> morphological operation (open) -> pepper -> salt\n", " \n", "For the full list of supported degradation effects, please see [documentation on degradation](https://github.com/microsoft/genalog/blob/main/genalog/degradation/README.md)." ], "cell_type": "markdown", "metadata": {} }, { "source": [ "We use `Jinja` to prepare html templates. You can find example of these Jinja templates in [our source code](https://github.com/microsoft/genalog/tree/main/genalog/generation/templates)." ], "cell_type": "markdown", "metadata": {} }, { "source": [ "## Document Generation\n", "\n", "With the above configurations, we can go ahead and start generate synthetic document.\n", "\n", "### Load Sample Text content\n", "\n", "You can use **any** text documents as the content of the generated images. For the sake of the tutorial, you can use the [sample text](https://github.com/microsoft/genalog/blob/main/example/sample/generation/example.txt) from our repo." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "sample_text_url = \"https://raw.githubusercontent.com/microsoft/genalog/main/example/sample/generation/example.txt\"\n", "sample_text = \"example.txt\"\n", "\n", "r = requests.get(sample_text_url, allow_redirects=True)\n", "open(sample_text, 'wb').write(r.content)\n" ] }, { "source": [ "### Generate Synthetic Documents\n", "\n", "Next, we can supply the three aforementioned configurations in initalizing `AnalogDocumentGeneration` object" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from genalog.pipeline import AnalogDocumentGeneration\n", "\n", "IMG_RESOLUTION = 300 # dots per inch (dpi) of the generated pdf/image\n", "\n", "doc_generation = AnalogDocumentGeneration(styles=STYLE_COMBINATIONS, degradations=DEGRADATIONS, resolution=IMG_RESOLUTION, template_path=None)" ] }, { "source": [ "To use custom templates, please set `template_path` to the folder of containing them. You can find more information from our [`document_generation.ipynb`](https://github.com/microsoft/genalog/blob/main/example/document_generation.ipynb).\n", "\n", "Once initialized, you can call `generate_img()` method to get the synthetic documents as images" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# for custom templates, please set template_path.\n", "img_array = doc_generation.generate_img(sample_text, HTML_TEMPLATE, target_folder=None) # returns the raw image bytes if target_folder is not specified" ] }, { "source": [ "```{note}\n", "Setting `target_folder` to `None` will return the raw image bytes as a `Numpy.ndarray`. Otherwise the generated image will be save on the disk as a PNG file in the specified path.\n", "```" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "### Display the Document" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import cv2\n", "from IPython.core.display import Image, display\n", "\n", "_, encoded_image = cv2.imencode('.png', img_array)\n", "display(Image(data=encoded_image, width=600))" ] }, { "source": [ "## Document Generation (Multi-process)\n", "\n", "To scale up the generation across multiple text files, you can use `generate_dataset_multiprocess`. The method will split the list of text filenames into batches and run document generation across different batches as subprocesses in parallel" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from genalog.pipeline import generate_dataset_multiprocess\n", "\n", "DST_PATH = \"data\" # where on disk to write the generated image\n", "\n", "generate_dataset_multiprocess(\n", " [sample_text], DST_PATH, STYLE_COMBINATIONS, DEGRADATIONS, HTML_TEMPLATE, \n", " resolution=IMG_RESOLUTION, batch_size=5\n", ")" ] }, { "source": [ "```{note}\n", "`[sample_text]` is a list of filenames to generate the synthetic dataset over.\n", "```" ], "cell_type": "markdown", "metadata": {} } ] }