{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3.6.9 64-bit ('.env': venv)"
  },
  "metadata": {
   "interpreter": {
    "hash": "463957e7759ed5c981e4d097e7f970bbf621ad48bd269f8044dc509b219ad94f"
   }
  },
  "interpreter": {
   "hash": "463957e7759ed5c981e4d097e7f970bbf621ad48bd269f8044dc509b219ad94f"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "source": [
    "# Generate your synthetic document\n",
    "\n",
    "\n",
    "```{figure} static/analog_doc_gen_pipeline.png\n",
    ":width: 500px\n",
    "```\n",
    "\n",
    "Genalog provides a simple interface (`AnalogDocumentGeneration`) to programmatic generate documents with degradation from a body of text."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from genalog.pipeline import AnalogDocumentGeneration\n"
   ]
  },
  {
   "source": [
    "## Configurations\n",
    "\n",
    "To use the pipeline, you will need to supply the following information:\n",
    "\n",
    "### CSS Style Combinations\n",
    "\n",
    "`STYLE_COMBINATIONS`: a dictionary defining the combination of styles to generate per text document (i.e. a copy of the same text document is generate per style combination)\n",
    "\n"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "STYLE_COMBINATIONS = {\n",
    "    \"language\": [\"en_US\"],\n",
    "     \"font_family\": [\"Segeo UI\"],\n",
    "     \"font_size\": [\"12px\"],\n",
    "     \"text_align\": [\"justify\"],\n",
    "     \"hyphenate\": [True],\n",
    "}"
   ]
  },
  {
   "source": [
    "```{note}\n",
    "Genalog depends on Weasyprint as the engine to render these CSS styles. Most of these fields are standard CSS properties and accepts common values as specified in [W3C CSS Properties](https://www.w3.org/Style/CSS/all-properties.en.html). For details, please see [Weasyprint Documentation](https://weasyprint.readthedocs.io/en/stable/features.html#fonts).\n",
    "```"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### Choose a Prebuild HTML Template\n",
    "\n",
    "`HTML_TEMPLATE`: name of html template used to generate the synthetic images. The `genalog` package has the following default templates: \n",
    "\n",
    "````{tab} columns.html.jinja\n",
    "```{figure} static/columns_Times_11px.png\n",
    ":width: 30%\n",
    "Document template with 2 columns \n",
    "```\n",
    "````\n",
    "````{tab} letter.html.jinja\n",
    "```{figure} static/letter_Times_11px.png\n",
    ":width: 30%\n",
    "Letter-like document template\n",
    "```\n",
    "````\n",
    "````{tab} text_block.html.jinja\n",
    "```{figure} static/text_block_Times_11px.png\n",
    ":width: 30%\n",
    "Simple text block template\n",
    "```\n",
    "````"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "HTML_TEMPLATE = \"text_block.html.jinja\"\n"
   ]
  },
  {
   "source": [
    "### Image Degradations\n",
    "\n",
    "`DEGRADATIONS`: a list defining the sequence of degradation effects applied onto the synthetic images. Each element is a two-element tuple of which the first element is one of the method names from  `genalog.degradation.effect` and the second element is the corresponding function keyword arguments.\n"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "````{tab} bleed_through\n",
    "```{figure} static/bleed_through.png\n",
    ":name: Bleed-through\n",
    ":width: 90%\n",
    "Mimics a document printed on two sides. Valid values: [0,1].\n",
    "```\n",
    "````\n",
    "````{tab} blur\n",
    "```{figure} static/blur.png\n",
    ":name: Blur\n",
    ":width: 90%\n",
    "Lowers image quality. Unit are in number of pixels.\n",
    "```\n",
    "````\n",
    "````{tab} salt/pepper\n",
    "```{figure} static/salt_pepper.png\n",
    ":name: Salt/Pepper\n",
    ":width: 65%\n",
    "Mimics ink degradation. Valid values: [0, 1].\n",
    "```\n",
    "````\n",
    "`````{tab} close/dilate\n",
    "```{figure} static/close_dilate.png\n",
    ":name: Close/Dilate\n",
    "Degrades printing quality.\n",
    "```\n",
    "````{margin}\n",
    "```{note}\n",
    "For more details on this degradation, see [Morphilogical Operations](https://homepages.inf.ed.ac.uk/rbf/HIPR2/morops.htm)\n",
    "```\n",
    "````\n",
    "`````\n",
    "`````{tab} open/erode\n",
    "```{figure} static/open_erode.png\n",
    ":name: Open/Errode\n",
    "Ink overflows\n",
    "```\n",
    "````{margin}\n",
    "```{note}\n",
    "For more details on this degradation, see [Morphilogical Operations](https://homepages.inf.ed.ac.uk/rbf/HIPR2/morops.htm)\n",
    "```\n",
    "````\n",
    "`````"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from genalog.degradation.degrader import ImageState\n",
    "\n",
    "DEGRADATIONS = [\n",
    "    (\"blur\", {\"radius\": 5}),\n",
    "    (\"bleed_through\", {\n",
    "        \"src\": ImageState.CURRENT_STATE,\n",
    "        \"background\": ImageState.ORIGINAL_STATE,\n",
    "        \"alpha\": 0.8,\n",
    "        \"offset_x\": -6,\n",
    "        \"offset_y\": -12,\n",
    "    }),\n",
    "    (\"morphology\", {\"operation\": \"open\", \"kernel_shape\":(9,9), \"kernel_type\":\"plus\"}),\n",
    "    (\"pepper\", {\"amount\": 0.005}),\n",
    "    (\"salt\", {\"amount\": 0.15}),\n",
    "]"
   ]
  },
  {
   "source": [
    "```{note}\n",
    "`ImageState.ORIGINAL_STATE` refers to the origin state of the image before applying any degradation, while\n",
    "`ImageState.CURRENT_STATE` refers to the state of the image after applying the last degradation effect.\n",
    "```\n",
    "\n",
    "The example above will apply degradation effects to synthetic images in the sequence of: \n",
    "    \n",
    "        blur -> bleed_through -> morphological operation (open) -> pepper -> salt\n",
    "    \n",
    "For the full list of supported degradation effects, please see [documentation on degradation](https://github.com/microsoft/genalog/blob/main/genalog/degradation/README.md)."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "We use `Jinja` to prepare html templates. You can find example of these Jinja templates in [our source code](https://github.com/microsoft/genalog/tree/main/genalog/generation/templates)."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## Document Generation\n",
    "\n",
    "With the above configurations, we can go ahead and start generate synthetic document.\n",
    "\n",
    "### Load Sample Text content\n",
    "\n",
    "You can use **any** text documents as the content of the generated images. For the sake of the tutorial, you can use the [sample text](https://github.com/microsoft/genalog/blob/main/example/sample/generation/example.txt) from our repo."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "sample_text_url = \"https://raw.githubusercontent.com/microsoft/genalog/main/example/sample/generation/example.txt\"\n",
    "sample_text = \"example.txt\"\n",
    "\n",
    "r = requests.get(sample_text_url, allow_redirects=True)\n",
    "open(sample_text, 'wb').write(r.content)\n"
   ]
  },
  {
   "source": [
    "### Generate Synthetic Documents\n",
    "\n",
    "Next, we can supply the three aforementioned configurations in initalizing `AnalogDocumentGeneration` object"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from genalog.pipeline import AnalogDocumentGeneration\n",
    "\n",
    "IMG_RESOLUTION = 300 # dots per inch (dpi) of the generated pdf/image\n",
    "\n",
    "doc_generation = AnalogDocumentGeneration(styles=STYLE_COMBINATIONS, degradations=DEGRADATIONS, resolution=IMG_RESOLUTION, template_path=None)"
   ]
  },
  {
   "source": [
    "To use custom templates, please set `template_path` to the folder of containing them. You can find more information from our [`document_generation.ipynb`](https://github.com/microsoft/genalog/blob/main/example/document_generation.ipynb).\n",
    "\n",
    "Once initialized, you can call `generate_img()` method to get the synthetic documents as images"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# for custom templates, please set template_path.\n",
    "img_array = doc_generation.generate_img(sample_text, HTML_TEMPLATE, target_folder=None) # returns the raw image bytes if target_folder is not specified"
   ]
  },
  {
   "source": [
    "```{note}\n",
    "Setting `target_folder` to `None` will return the raw image bytes as a `Numpy.ndarray`. Otherwise the generated image will be save on the disk as a PNG file in the specified path.\n",
    "```"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### Display the Document"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cv2\n",
    "from IPython.core.display import Image, display\n",
    "\n",
    "_, encoded_image = cv2.imencode('.png', img_array)\n",
    "display(Image(data=encoded_image, width=600))"
   ]
  },
  {
   "source": [
    "## Document Generation (Multi-process)\n",
    "\n",
    "To scale up the generation across multiple text files, you can use `generate_dataset_multiprocess`. The method will split the list of text filenames into batches and run document generation across different batches as subprocesses in parallel"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from genalog.pipeline import generate_dataset_multiprocess\n",
    "\n",
    "DST_PATH = \"data\" # where on disk to write the generated image\n",
    "\n",
    "generate_dataset_multiprocess(\n",
    "    [sample_text], DST_PATH, STYLE_COMBINATIONS, DEGRADATIONS, HTML_TEMPLATE, \n",
    "    resolution=IMG_RESOLUTION, batch_size=5\n",
    ")"
   ]
  },
  {
   "source": [
    "```{note}\n",
    "`[sample_text]` is a list of filenames to generate the synthetic dataset over.\n",
    "```"
   ],
   "cell_type": "markdown",
   "metadata": {}
  }
 ]
}