# Generate your synthetic document


```{figure} static/analog_doc_gen_pipeline.png
:width: 500px
```

Genalog provides a simple interface (`AnalogDocumentGeneration`) to programmatic generate documents with degradation from a body of text.

In [None]:
from genalog.pipeline import AnalogDocumentGeneration


## Configurations

To use the pipeline, you will need to supply the following information:

### CSS Style Combinations

`STYLE_COMBINATIONS`: a dictionary defining the combination of styles to generate per text document (i.e. a copy of the same text document is generate per style combination)



In [None]:
STYLE_COMBINATIONS = {
 "language": ["en_US"],
 "font_family": ["Segeo UI"],
 "font_size": ["12px"],
 "text_align": ["justify"],
 "hyphenate": [True],
}

```{note}
Genalog depends on Weasyprint as the engine to render these CSS styles. Most of these fields are standard CSS properties and accepts common values as specified in [W3C CSS Properties](https://www.w3.org/Style/CSS/all-properties.en.html). For details, please see [Weasyprint Documentation](https://weasyprint.readthedocs.io/en/stable/features.html#fonts).
```

### Choose a Prebuild HTML Template

`HTML_TEMPLATE`: name of html template used to generate the synthetic images. The `genalog` package has the following default templates: 

````{tab} columns.html.jinja
```{figure} static/columns_Times_11px.png
:width: 30%
Document template with 2 columns 
```
````
````{tab} letter.html.jinja
```{figure} static/letter_Times_11px.png
:width: 30%
Letter-like document template
```
````
````{tab} text_block.html.jinja
```{figure} static/text_block_Times_11px.png
:width: 30%
Simple text block template
```
````

In [None]:
HTML_TEMPLATE = "text_block.html.jinja"


### Image Degradations

`DEGRADATIONS`: a list defining the sequence of degradation effects applied onto the synthetic images. Each element is a two-element tuple of which the first element is one of the method names from `genalog.degradation.effect` and the second element is the corresponding function keyword arguments.


````{tab} bleed_through
```{figure} static/bleed_through.png
:name: Bleed-through
:width: 90%
Mimics a document printed on two sides. Valid values: [0,1].
```
````
````{tab} blur
```{figure} static/blur.png
:name: Blur
:width: 90%
Lowers image quality. Unit are in number of pixels.
```
````
````{tab} salt/pepper
```{figure} static/salt_pepper.png
:name: Salt/Pepper
:width: 65%
Mimics ink degradation. Valid values: [0, 1].
```
````
`````{tab} close/dilate
```{figure} static/close_dilate.png
:name: Close/Dilate
Degrades printing quality.
```
````{margin}
```{note}
For more details on this degradation, see [Morphilogical Operations](https://homepages.inf.ed.ac.uk/rbf/HIPR2/morops.htm)
```
````
`````
`````{tab} open/erode
```{figure} static/open_erode.png
:name: Open/Errode
Ink overflows
```
````{margin}
```{note}
For more details on this degradation, see [Morphilogical Operations](https://homepages.inf.ed.ac.uk/rbf/HIPR2/morops.htm)
```
````
`````

In [None]:
from genalog.degradation.degrader import ImageState

DEGRADATIONS = [
 ("blur", {"radius": 5}),
 ("bleed_through", {
 "src": ImageState.CURRENT_STATE,
 "background": ImageState.ORIGINAL_STATE,
 "alpha": 0.8,
 "offset_x": -6,
 "offset_y": -12,
 }),
 ("morphology", {"operation": "open", "kernel_shape":(9,9), "kernel_type":"plus"}),
 ("pepper", {"amount": 0.005}),
 ("salt", {"amount": 0.15}),
]

```{note}
`ImageState.ORIGINAL_STATE` refers to the origin state of the image before applying any degradation, while
`ImageState.CURRENT_STATE` refers to the state of the image after applying the last degradation effect.
```

The example above will apply degradation effects to synthetic images in the sequence of: 
 
 blur -> bleed_through -> morphological operation (open) -> pepper -> salt
 
For the full list of supported degradation effects, please see [documentation on degradation](https://github.com/microsoft/genalog/blob/main/genalog/degradation/README.md).

We use `Jinja` to prepare html templates. You can find example of these Jinja templates in [our source code](https://github.com/microsoft/genalog/tree/main/genalog/generation/templates).

## Document Generation

With the above configurations, we can go ahead and start generate synthetic document.

### Load Sample Text content

You can use **any** text documents as the content of the generated images. For the sake of the tutorial, you can use the [sample text](https://github.com/microsoft/genalog/blob/main/example/sample/generation/example.txt) from our repo.

In [None]:
import requests

sample_text_url = "https://raw.githubusercontent.com/microsoft/genalog/main/example/sample/generation/example.txt"
sample_text = "example.txt"

r = requests.get(sample_text_url, allow_redirects=True)
open(sample_text, 'wb').write(r.content)


### Generate Synthetic Documents

Next, we can supply the three aforementioned configurations in initalizing `AnalogDocumentGeneration` object

In [None]:
from genalog.pipeline import AnalogDocumentGeneration

IMG_RESOLUTION = 300 # dots per inch (dpi) of the generated pdf/image

doc_generation = AnalogDocumentGeneration(styles=STYLE_COMBINATIONS, degradations=DEGRADATIONS, resolution=IMG_RESOLUTION, template_path=None)

To use custom templates, please set `template_path` to the folder of containing them. You can find more information from our [`document_generation.ipynb`](https://github.com/microsoft/genalog/blob/main/example/document_generation.ipynb).

Once initialized, you can call `generate_img()` method to get the synthetic documents as images

In [None]:
# for custom templates, please set template_path.
img_array = doc_generation.generate_img(sample_text, HTML_TEMPLATE, target_folder=None) # returns the raw image bytes if target_folder is not specified

```{note}
Setting `target_folder` to `None` will return the raw image bytes as a `Numpy.ndarray`. Otherwise the generated image will be save on the disk as a PNG file in the specified path.
```

### Display the Document

In [None]:
import cv2
from IPython.core.display import Image, display

_, encoded_image = cv2.imencode('.png', img_array)
display(Image(data=encoded_image, width=600))

## Document Generation (Multi-process)

To scale up the generation across multiple text files, you can use `generate_dataset_multiprocess`. The method will split the list of text filenames into batches and run document generation across different batches as subprocesses in parallel

In [None]:
from genalog.pipeline import generate_dataset_multiprocess

DST_PATH = "data" # where on disk to write the generated image

generate_dataset_multiprocess(
 [sample_text], DST_PATH, STYLE_COMBINATIONS, DEGRADATIONS, HTML_TEMPLATE, 
 resolution=IMG_RESOLUTION, batch_size=5
)

```{note}
`[sample_text]` is a list of filenames to generate the synthetic dataset over.
```