# Configuration file specification

Datanymizer uses a configuration file (`config.yml`) to determine what data to dump and how to anonymize it.

A config example (for the Postgres demo database [DVD Rental](https://sp.postgresqltutorial.com/wp-content/uploads/2019/05/dvdrental.zip)):

```yaml
tables:
  - name: actor
    rules:
      first_name:
        # random name
        first_name: {}
      last_name:
        # random surname
        last_name: {}
      last_update:
        # random date
        datetime:
          from: 1990-01-01T00:00:00+00:00
          to: 2010-12-31T00:00:00+00:00
    query:
      # keeping data of the actor Jane Jackman unanonymized
      transform_condition: "NOT (first_name = 'Jane' AND last_name = 'Jackman')"
      # not dumping the actor with actor_id = 132 (Adam Hopper)
      dump_condition: "actor_id <> 132"

  - name: address
    rules:
      address:
        # using template
        template:
          # using transformed (anonymized) value of district
          format: "{{ final.district }}, {{ _1 }}, {{ _2 }}"
          rules:
            # random street name
            - street_name: {}
            # random building number
            - building_number: {}
      address2:
        # using the template engine (Tera, it is very similar to Jinja) features: condition and built-in function:
        # we add an address comment to roughly half of the rows
        # the template engine is very agile
        template:
          format: "{% if get_random(start=1, end=2) == 1 %}Comment: {{ _1 }}{% endif %}"
          rules:
            # lorem ipsum words (the number of words is 1-2)
            - words:
                min: 1
                max: 2
      district:
        template:
          format: "{{ _1 }}, {{ _2 }}"
          rules:
            # nested template
            - template:
                format: "{{ _2 }} ({{ _1 }})"
                rules:
                  # random country code
                  - country_code: {}
                  # random state abbreviation
                  - state_abbr: {}
            - template:
                format: "dst"
      phone:
        # random phone with some format
        phone:
          format: "7900#######"
          # phones will be unique
          uniq: true
      postal_code:
        # random postal code
        post_code: {}
    # you must specify the order of rule execution when using `final`
    rule_order:
      - address

  - name: city
    rules:
      city:
        city: {}

  - name: customer
    rules:
      active:
        # using anonymized `activebool` value
        template:
          format: "{% if final.activebool == 'TRUE' %}1{% else %}0{% endif %}"
      activebool:
        # the probability of `true` is 80%
        boolean:
          ratio: 80
      create_date:
        datetime:
          from: 2000-01-01T00:00:00+00:00
          to: 2020-12-31T00:00:00+00:00
      email:
        # using the original first name value in the anonymized email
        # also using the anonymized value of `active`
        template:
          format: "{{ prev.first_name | lower }}-{{ final.active }}-{{ _1 }}"
          rules:
            # random email
            - email: {}
      last_name:
        # using of original value (keep the first letter of the last name)
        template:
          format: "{{ _0 | truncate(length=1) }}"
    rule_order:
      - active
      - email

  - name: film
    rules:
      fulltext:
        # no transformation
        none: ~
      length:
        # random number
        random_num:
          min: 50
          max: 200
      rating:
        pipeline:
          # using pipelines
          pipes:
            - template:
                format: "r"
            - capitalize: ~

  - name: film_actor
    rules: {}
    query:
      # not dumping the actor with id = 132 (Adam Hopper)
      dump_condition: "actor_id <> 132"

  - name: payment
    rules:
      amount:
        # using the value from globals
        template:
          format: "{{ prev.amount | float * payment_k }}"

  - name: staff
    rules:
      email:
        email: {}
      username:
        template:
          # using the values from globals and template variables
          format: "{{ global_value }}.{{ template_var }}.{{ _1 }}"
          rules:
            # random number
            - random_num:
                min: 100
                max: 999
          variables:
            template_var: "tv456"
      password:
        # random hex token
        hex_token:
          len: 40

default:
  locale: EN

# some global variables (they are available in templates)
globals:
  global_value: "gv123"
  payment_k: 1.73
```

The config file contains following sections:

| Section                     | Mandatory | YAML type  | Description
|---                          |---        |---         |---
| [tables](#tables)           | yes       | list       | A list of anonymized tables
| [table_order](#table_order) | no        | list       | An order of table dumping
| [default](#default)         | no        | dictionary | Default values for different anonymization rules
| [filter](#filter)           | no        | dictionary | A filter for tables schema and data (what to skip when dumping)
| [globals](#globals)         | no        | dictionary | Some global values (they are available in anonymization templates)

## tables

The `tables` section is a list of anonymized [tables](#table).
This is a main element of the config.

Example (there are anonymization rules for two database tables: `actor` and `address`):

```yaml
tables:
  - name: actor
    rules:
      first_name:
        # random name
        first_name: {}
      last_name:
        # random surname
        last_name: {}
      last_update:
        # random date
        datetime:
          from: 1990-01-01T00:00:00+00:00
          to: 2010-12-31T00:00:00+00:00
    query:
      # keeping data of the actor Jane Jackman unanonymized
      transform_condition: "NOT (first_name = 'Jane' AND last_name = 'Jackman')"
      # not dumping the actor with actor_id = 132 (Adam Hopper)
      dump_condition: "actor_id <> 132"

  - name: address
    rules:
      address:
        # using template
        template:
          # using transformed (anonymized) value of district
          format: "{{ final.district }}, {{ _1 }}, {{ _2 }}"
          rules:
            # random street name
            - street_name: {}
            # random building number
            - building_number: {}
```

### table

| Section                   | Mandatory | YAML type  | Description
|---                        |---        |---         |---
| `name`                    | yes       | text       | The table name in the database
| [rules](#rules)           | yes       | dictionary | Anonymization rules for this table (the column names are the dictionary keys)
| [rule_order](#rule_order) | no        | list       | An order of rule execution
| [query](#query)           | no        | dictionary | Conditions for SQL queries for dumping data 

You can use table names with schema (e.g. `public.users`) or without it (just `users`). In the latter case, this means
that the rules will be applied to the `users` table in any schema.

#### rules

Anonymization rules (we call them `transformers`) for the table columns.

Dictionary keys are the column names.
Each value contains an anonymizing configuration for column (a name of transformer - an address, a company name, a person name, 
some template, etc, with its options).

| Rule (transformer)             | Description                                                                   |
|--------------------------------|------------------------------------------------------------------------------ |
| `email`                        | Emails with different options                                                 |
| `ip`                           | IP addresses. Supports IPv4 and IPv6                                          |
| `words`                        | Lorem words with different length                                             |
| `first_name`                   | First name generator                                                          |
| `last_name`                    | Last name generator                                                           |
| `city`                         | City names generator                                                          |
| `phone`                        | Generate random phone with different `format`                                 |
| `pipeline`                     | Use pipeline to generate more complicated values                              |
| `capitalize`                   | Like filter, it capitalizes input value                                       |
| `template`                     | Template engine for generate random text with included rules                  |
| `digit`                        | Random digit (in range `0..9`), localized                                               |
| `random_num`                | Random number with `min` and `max` options                                    |
| `password`                     | Password with different length options<br> (supports `max` and `min` options) |
| `datetime`                     | Make DateTime strings with options (`from` and `to`)                          |
| more than 70 rules in total... |                                                                               |

For the complete list of rules please refer [this document](transformers.md).

**Some transformer examples:**

##### first_name

It gets a person first name.

Examples:

The default:

```yaml
rules:
  field_name:
    first_name: {}
```

You can configure locale:

```yaml
rules:
  field_name:
    first_name:
      locale: RU
```

##### phone

It gets a random phone number.

Examples:

The default:

```yaml
rules:
  field_name:
    phone: {}      
```

You can specify the phone format:

```yaml
rules:
  field_name:
    phone:
      format: "+7^#########"
```

where: 
* `#` - any digit from 0 to 9
* `^` - any digit from 1 to 9
  
Also, you can use any other symbols in format: `^##-00-### (##-##)`.

The default format is `+###########`.

If you want to generate unique phone numbers for this database column, use the `uniq` option:

```yaml
rules:
  field_name:
    phone:
      uniq: true
```

The transformer will collect information about generated numbers and check their uniqueness.
If such a number already exists in the list, then the transformer will try to generate the value again.
The number of attempts is limited by the number of available invariants based on the format.

##### random_num

Gets a random number.

Examples:

The default:

```yaml
rules:
  field_name:
    random_num: {}
```

You can specify a range (one border or both):

```yaml
rules:
  field_name:
    random_num:
      min: 10
      max: 20
```

The default range is from `0` to `2^64 - 1` (for 64-bit application binary).

If you want to generate unique numbers, use this option:

```yaml
  rules:
    field_name:
      random_num:
        uniq: true
```

The transformer will collect information about generated numbers and check their uniqueness.
If such a number already exists in the list, then the transformer will try to generate the value again.
You can limit the number of tries (the default is 3):

```yaml
rules:
  field_name:
    random_num:
      uniq:
        required: true
        try_count: 5
```

##### template

This is the most sophisticated and flexible transformer.

It uses the [Tera](https://tera.netlify.app) template engine
(inspired by [Jinja2](https://jinja.palletsprojects.com)).

Specification:

| Section                   | Mandatory | YAML type  | Description
|---                        |---        |---         |---
| `format`                  | yes       | text       | The template for generated value
| `rules`                   | no        | list       | Nested rules (transformers). You can use them in the template
| `variables`               | no        | dictionary | Template variables

Examples:

```yaml
rules:
  field_name:
    template:
      format: "Hello, {{name}}! {{_1}}:{{_0 | upper}}"
      rules:
        - email: {}    
      variables:
        name: Alex
```

where:
* `_0` - transformed value (original);
* `_1`, `_2`, ... `_N` - nested rules by index (started from 1). You can use any transformer (including templates); 
* `name` - the named variable from the `variables` section.

It will generate something like `Hello, Alex! some-fake-email@gmail.com:ORIGINALVALUE`.

You can use any filter or markup from the Tera template engine.

Also, you can use the [global](#globals) variables in templates.

You can reference values of other row fields in templates.
Use the `prev` special variable for original values and the `final` special variable - for anonymized:

```yaml
tables:
  - name: some_table
    # You must specify the order of rule execution when using `final`
    rule_order:
      - greeting
      - options
    rules:
      first_name:
        first_name: {}
      greeting:
        template:
          # Keeping the first name, but anonymizing the last name   
          format: "Hello, {{ prev.first_name }} {{ final.last_name }}!"
      options:
        template:
          # Using the anonymized value again   
          format: "{greeting: \"{{ final.greeting }}\"}"
```

You must specify the order of rule execution when using `final` with [rule_order](#rule_order).
All rules not listed will be placed at the beginning (i.e., you must list only rules with `final`).

#### rule_order

A list of columns that will be processed in the specified order (after all columns that are not in the list). 
The order of execution for other columns is not guaranteed. 

Look at this table configuration example:

```yaml
name: customer
rules:
  active:
    # using anonymized `activebool` value
    template:
      format: "{% if final.activebool == 'TRUE' %}1{% else %}0{% endif %}"
  activebool:
    # the probability of `true` is 80%
    boolean:
      ratio: 80
  create_date:
    datetime:
      from: 2000-01-01T00:00:00+00:00
      to: 2020-12-31T00:00:00+00:00
  email:
    # using the original first name value in the anonymized email
    # also using the anonymized value of `active`
    template:
      format: "{{ prev.first_name | lower }}-{{ final.active }}-{{ _1 }}"
      rules:
        # random email
        - email: {}
  last_name:
    # using of original value (keep the first letter of the last name)
    template:
      format: "{{ _0 | truncate(length=1) }}"
rule_order:
  - active
  - email
```

The order of column processing will be as follows:

1. `activebool`, `create_date`, `last_name` (the exact order is not guaranteed)
2. `active`
3. `email`

_You only need the `rule_order` section when using the `template` transformer with the `final` special template variable._ 

For additional information please refer to the [template](transformers.md#template) transformer documentation.

#### query

| Section               | Mandatory | YAML type | Description
|---                    |---        |---        |---
| `dump_condition`      | no        | text      | SQL `WHERE` statement for dumped data
| `limit`               | no        | integer   | SQL `LIMIT` for dumped data
| `transform_condition` | no        | text      | SQL `WHERE` statement for anonymizing data

You can specify conditions (SQL `WHERE` statement) and limit for dumped data from the table:

```yaml
# config.yml
tables:
  - name: people
    query:
      # don't dump some rows
      dump_condition: "last_name <> 'Sensitive'"
      # select maximum 100 rows
      limit: 100 
```

As the additional option, you can specify SQL conditions that define which rows will be transformed (anonymized):

```yaml
# config.yml
tables:
  - name: people
    query:
      # don't dump some rows
      dump_condition: "last_name <> 'Sensitive'"
      # preserve original values for some rows
      transform_condition: "NOT (first_name = 'John' AND last_name = 'Doe')"      
      # select maximum 100 rows
      limit: 100
```

You can use the `dump_condition`, `transform_condition` and `limit` options in any combination (only
`transform_condition`; `transform_condition` and `limit`; etc).

If you don't need data from a particular table at all, please refer to the [filter](#filter) section.

## table_order

A list of tables that will be dumped in the specified order (after all tables that are not in the list).
The order of execution for other tables depends on foreign keys.

Look at this configuration example:

```yaml
tables:
  - name: "table1"
    rules: {}
  - name: "table2"
    rules: {}
  - name: "table3"
    rules: {}    
table_order:
  - "table1"
  - "table2"
```

The order of table dumping will be as follows:

1. `table3`
2. `table1`
3. `table2`

You may need this section when using the built-in key-value store in the `template` transformer for sharing data between
tables.

For additional information please refer to the [template](transformers.md#template) transformer documentation.

## default

| Section       | Mandatory | YAML type | Description
|---            |---        |---        |---
| `locale`      | no        | text      | The default locale for transformers

Supported locales are `EN` (the default one), `ZH_TW` (traditional chinese) and `RU` (translation in progress).
We plan to support more locales in the future.

You can override the locale for each transformer (rule) in its options. Some transformers are not affected by locale.

Example:

```yaml
default:
  locale: RU
```

## filter

You can specify which tables you choose (whitelisting) or ignore (blacklisting) to dump.

You must use the full table names here (with schema). 

You can use wildcards: 

* `?` matches exactly one occurrence of any character;
* `*` matches arbitrary many (including zero) occurrences of any character.


### Examples

For dumping only `public.markets` and `public.users` data:

```yaml
filter:
  only:
    - public.markets
    - public.users
```

For ignoring these tables and dump data from others:

```yaml
filter:
  except:
    - public.markets
    - public.users
```

You can also specify data and schema filters separately.

This is equivalent to the previous example:

```yaml
filter:
  data:
    except:
      - public.markets
      - public.users
```

For skipping schema and data from other tables:

```yaml
filter:
  schema:
    only:
      - public.markets
      - public.users
```

For skipping schema for `markets` table and dumping data only from `users` table:

```yaml
filter:
  data:
    only:
      - public.users
  schema:
    except:
      - public.markets
```

For skipping schema and data from all tables in the schema `other` (you should use the quotes):

```yaml
filter:
  schema:
    except:
      - "other.*"
```

For dumping data only from `public.table1`, `public.table2`, `public.table3`, etc:

```yaml
filter:
  - "public.table?"
```

If you need only a subset of the data, please refer to the [query](#query) section.

## templates
You can specify some templates in config to reuse them in you [template](transformers.md#template) rules.
There are different kinds of templates:

- `raw` templates is named templates which may be imported or included by name into your field template, you can use macros to extend complex template.
- `files` templates is array of paths to files with template context.

```yaml
tables:
  - name: some_page
    rules:
      some_column:
        template:
          format: >
            {% import "base" as macros -%}
            {{ macros::decrement(n=10) }}
templates:
  raw:
    base: >
      {% macro decrement(n) -%}
      {% if n > 1 %}{{ n }}-{{ self::decrement(n=n-1) }}{% else %}1{% endif -%}
      {% endmacro decrement -%}"#;
  files:
    - ./templates/button.html
```

## globals

You can specify global variables available in all [template](transformers.md#template) rules.

```yaml
tables:
  - name: payment
    rules:
      amount:
      # using the value from globals
      template:
        format: "{{ prev.amount | float * payment_k }}"

  - name: staff
    rules:
      username:
        template:
          # using the value from globals
          format: "{{ global_value }}.{{ _1 }}"
            rules:
              # random number
              - random_num:
                  min: 100
                  max: 999

# global variables (they are available in templates)
globals:
  global_value: "gv123"
  payment_k: 1.73
```