# Creating a New Cartography Module

> **Related docs**: [Main AGENTS.md](../../../AGENTS.md) | [Add Node Type](add-node-type.md) | [Add Relationship](add-relationship.md) | [Analysis Jobs](analysis-jobs.md)

This guide walks you through creating a new Cartography intel module from scratch, covering the complete sync pattern, data model definitions, and testing.

## Table of Contents

1. [Module Structure](#module-structure) - File organization and entry points
2. [The Sync Pattern](#the-sync-pattern-get-transform-load-cleanup) - GET, TRANSFORM, LOAD, CLEANUP
3. [Data Model](#data-model-defining-nodes-and-relationships) - Nodes, properties, and relationships
4. [Configuration and Credentials](#configuration-and-credentials) - CLI args and validation
5. [Testing Your Module](#testing-your-module) - Integration tests and test data
6. [Schema Documentation](#schema-documentation) - Documenting your schema
7. [Coding Conventions](#coding-conventions) - Error handling, type hints, logging
8. [Common Pitfalls](#common-pitfalls) - Troubleshooting common issues
9. [Final Checklist](#final-checklist) - Pre-submission checklist

## Module Structure

Every Cartography intel module follows this structure:

```
cartography/intel/your_module/
├── __init__.py          # Main entry point with sync orchestration
├── users.py             # Domain-specific sync modules (users, devices, etc.)
├── devices.py           # Additional domain modules as needed
└── ...

cartography/models/your_module/
├── user.py              # Data model definitions
├── tenant.py            # Tenant/account model
└── ...
```

### Main Entry Point (`__init__.py`)

```python
import logging
import neo4j
from cartography.config import Config
from cartography.util import timeit
import cartography.intel.your_module.users


logger = logging.getLogger(__name__)


@timeit
def start_your_module_ingestion(neo4j_session: neo4j.Session, config: Config) -> None:
    """
    Main entry point for your module ingestion
    """
    # Validate configuration
    if not config.your_module_api_key:
        logger.info("Your module import is not configured - skipping this module.")
        return

    # Set up common job parameters for cleanup
    common_job_parameters = {
        "UPDATE_TAG": config.update_tag,
        "TENANT_ID": config.your_module_tenant_id,  # if applicable
    }

    # Call domain-specific sync functions
    cartography.intel.your_module.users.sync(
        neo4j_session,
        config.your_module_api_key,
        config.your_module_tenant_id,
        config.update_tag,
        common_job_parameters,
    )
```

## The Sync Pattern: Get, Transform, Load, Cleanup

Every sync function follows this exact pattern:

```python
@timeit
def sync(
    neo4j_session: neo4j.Session,
    api_key: str,
    tenant_id: str,
    update_tag: int,
    common_job_parameters: dict[str, Any],
) -> None:
    """
    Main sync entry point for the module.
    """
    logger.info("Starting MyResource sync")

    # 1. GET - Fetch data from API
    logger.debug("Fetching MyResource data from API")
    raw_data = get(api_key, tenant_id)

    # 2. TRANSFORM - Shape data for ingestion
    logger.debug("Transforming %d MyResource items", len(raw_data))
    transformed_data = transform(raw_data)

    # 3. LOAD - Ingest to Neo4j using data model
    load_users(neo4j_session, transformed_data, tenant_id, update_tag)

    # 4. CLEANUP - Remove stale data
    logger.debug("Running MyResource cleanup job")
    cleanup(neo4j_session, common_job_parameters)

    logger.info("Completed MyResource sync")


def load_users(
    neo4j_session: neo4j.Session,
    data: list[dict[str, Any]],
    tenant_id: str,
    update_tag: int,
) -> None:
    load(
        neo4j_session,
        MyResourceSchema(),
        data,
        lastupdated=update_tag,
        TENANT_ID=tenant_id,
    )


def sync_for_parent(
    neo4j_session: neo4j.Session,
    parent_id: str,
    config: Config,
    common_job_parameters: dict[str, Any],
) -> None:
    """
    Sync resources for a specific parent (e.g., project, account, region).
    """
    logger.debug("Syncing MyResource for %s", parent_id)

    data = get_for_parent(parent_id, config)

    logger.debug("Transforming %d MyResource for %s", len(data), parent_id)
    transformed = transform(data)

    load_users(neo4j_session, transformed, parent_id, common_job_parameters["UPDATE_TAG"])
```

### GET: Fetching Data

The `get` function should be "dumb" - just fetch data and raise exceptions on failure:

```python
@timeit
@aws_handle_regions  # Handles common AWS errors like region availability, only for AWS modules.
def get(api_key: str, tenant_id: str) -> dict[str, Any]:
    """
    Fetch data from external API
    Should be simple and raise exceptions on failure
    """
    payload = {
        "api_key": api_key,
        "tenant_id": tenant_id,
    }

    session = Session()
    response = session.post(
        "https://api.yourservice.com/users",
        json=payload,
        timeout=(60, 60),  # (connect_timeout, read_timeout)
    )
    response.raise_for_status()  # Raise exception on HTTP error
    return response.json()
```

**Key Principles for `get()` Functions:**

1. **Minimal Error Handling**: Avoid adding try/except blocks in `get()` functions. Let errors propagate up to the caller.
   ```python
   # DON'T: Add complex error handling in get()
   def get_users(api_key: str) -> dict[str, Any]:
       try:
           response = requests.get(...)
           response.raise_for_status()
           return response.json()
       except requests.exceptions.HTTPError as e:
           if e.response.status_code == 401:
               logger.error("Invalid API key")
           elif e.response.status_code == 429:
               logger.error("Rate limit exceeded")
           raise
       except requests.exceptions.RequestException as e:
           logger.error(f"Network error: {e}")
           raise

   # DO: Keep it simple and let errors propagate
   def get_users(api_key: str) -> dict[str, Any]:
       response = requests.get(...)
       response.raise_for_status()
       return response.json()
   ```

2. **Use Decorators**: For AWS modules, use `@aws_handle_regions` to handle common AWS errors:
   ```python
   @timeit
   @aws_handle_regions  # Handles region availability, throttling, etc.
   def get_ec2_instances(boto3_session: boto3.session.Session, region: str) -> list[dict[str, Any]]:
       client = boto3_session.client("ec2", region_name=region)
       return client.describe_instances()["Reservations"]
   ```

3. **Fail Loudly**: If an error occurs, let it propagate up to the caller. This helps users identify and fix issues quickly:
   ```python
   # DON'T: Silently continue on error
   def get_data() -> dict[str, Any]:
       try:
           return api.get_data()
       except Exception:
           return {}  # Silently continue with empty data

   # DO: Let errors propagate
   def get_data() -> dict[str, Any]:
       return api.get_data()  # Let errors propagate to caller
   ```

4. **Timeout Configuration**: Set appropriate timeouts to avoid hanging:
   ```python
   # DO: Set timeouts
   response = session.post(
       "https://api.service.com/users",
       json=payload,
       timeout=(60, 60),  # (connect_timeout, read_timeout)
   )
   ```

### TRANSFORM: Shaping Data

Transform data to make it easier to ingest. Handle required vs optional fields carefully:

```python
def transform(api_result: dict[str, Any]) -> list[dict[str, Any]]:
    """
    Transform API data for Neo4j ingestion
    """
    result: list[dict[str, Any]] = []

    for user_data in api_result["users"]:
        transformed_user = {
            # Required fields - use direct access (will raise KeyError if missing)
            "id": user_data["id"],
            "email": user_data["email"],

            # Optional fields - use .get() with None default
            "name": user_data.get("name"),
            "last_login": user_data.get("last_login"),
        }
        result.append(transformed_user)

    return result
```

**Key Principles:**
- **Required fields**: Use `data["field"]` - let it fail if missing
- **Optional fields**: Use `data.get("field")` - defaults to `None`
- **Consistency**: Use `None` for missing values, not empty strings

## Data Model: Defining Nodes and Relationships

Modern Cartography uses a declarative data model. Here's how to define your schema:

### Node Properties

Define the properties that will be stored on your node:

```python
from dataclasses import dataclass
from cartography.models.core.common import PropertyRef
from cartography.models.core.nodes import CartographyNodeProperties

@dataclass(frozen=True)
class YourServiceUserNodeProperties(CartographyNodeProperties):
    # Required unique identifier
    id: PropertyRef = PropertyRef("id")

    # Automatic fields (set by cartography)
    lastupdated: PropertyRef = PropertyRef("lastupdated", set_in_kwargs=True)

    # Business fields from your API
    email: PropertyRef = PropertyRef("email", extra_index=True)  # Create index for queries
    name: PropertyRef = PropertyRef("name")
    created_at: PropertyRef = PropertyRef("created_at")
    last_login: PropertyRef = PropertyRef("last_login")
    is_admin: PropertyRef = PropertyRef("is_admin")

    # Fields from kwargs (same for all records in a batch)
    tenant_id: PropertyRef = PropertyRef("TENANT_ID", set_in_kwargs=True)
```

**PropertyRef Parameters:**
- First parameter: Key in your data dict or kwarg name. Use keys when you are ingesting a list of records. Use kwargs when you want to set the same value for all records in the list of records.
- `extra_index=True`: Create database index for better query performance
- `set_in_kwargs=True`: Value comes from kwargs passed to `load()`, not from individual records

> For advanced node configurations (extra labels, ontology integration), see [Adding a New Node Type](add-node-type.md).

### Node Schema

Define your complete node schema:

```python
from cartography.models.core.nodes import CartographyNodeSchema
from cartography.models.core.relationships import OtherRelationships


@dataclass(frozen=True)
class YourServiceUserSchema(CartographyNodeSchema):
    label: str = "YourServiceUser"                              # Neo4j node label
    properties: YourServiceUserNodeProperties = YourServiceUserNodeProperties()
    sub_resource_relationship: YourServiceTenantToUserRel = YourServiceTenantToUserRel()

    # Optional: Additional relationships
    other_relationships: OtherRelationships = OtherRelationships([
        YourServiceUserToHumanRel(),  # Connect to Human nodes
    ])
```

### Sub-Resource Relationships: Always Point to Tenant-Like Objects

The `sub_resource_relationship` should **always** refer to a tenant-like object that represents the ownership or organizational boundary of the resource. This is crucial for proper data organization and cleanup operations.

**Correct Examples:**
- **AWS Resources**: Point to `AWSAccount` (tenant = AWS account)
- **Azure Resources**: Point to `AzureSubscription` (tenant = Azure subscription)
- **GCP Resources**: Point to `GCPProject` (tenant = GCP project)
- **SaaS Applications**: Point to `YourServiceTenant` (tenant = organization/company)
- **GitHub Resources**: Point to `GitHubOrganization` (tenant = GitHub org)

**Incorrect Examples:**
- Pointing to a parent resource that's not tenant-like (e.g., `ECSTaskDefinition` -> `ECSTask`)
- Pointing to infrastructure components (e.g., `ECSContainer` -> `ECSTask`)
- Pointing to logical groupings that aren't organizational boundaries

**Example: AWS ECS Container Definitions**

```python
# CORRECT: Container definitions belong to AWS accounts
@dataclass(frozen=True)
class ECSContainerDefinitionSchema(CartographyNodeSchema):
    label: str = "ECSContainerDefinition"
    properties: ECSContainerDefinitionNodeProperties = ECSContainerDefinitionNodeProperties()
    sub_resource_relationship: ECSContainerDefinitionToAWSAccountRel = ECSContainerDefinitionToAWSAccountRel()
    other_relationships: OtherRelationships = OtherRelationships([
        ECSContainerDefinitionToTaskDefinitionRel(),  # Business relationship
    ])

# CORRECT: Relationship to AWS Account (tenant-like)
@dataclass(frozen=True)
class ECSContainerDefinitionToAWSAccountRel(CartographyRelSchema):
    target_node_label: str = "AWSAccount"
    target_node_matcher: TargetNodeMatcher = make_target_node_matcher({
        "id": PropertyRef("AWS_ID", set_in_kwargs=True),
    })
    direction: LinkDirection = LinkDirection.INWARD
    rel_label: str = "RESOURCE"
    properties: ECSContainerDefinitionToAWSAccountRelProperties = ECSContainerDefinitionToAWSAccountRelProperties()

# CORRECT: Business relationship to task definition (not tenant-like)
@dataclass(frozen=True)
class ECSContainerDefinitionToTaskDefinitionRel(CartographyRelSchema):
    target_node_label: str = "ECSTaskDefinition"
    target_node_matcher: TargetNodeMatcher = make_target_node_matcher({
        "id": PropertyRef("_taskDefinitionArn"),
    })
    direction: LinkDirection = LinkDirection.INWARD
    rel_label: str = "HAS_CONTAINER_DEFINITION"
    properties: ECSContainerDefinitionToTaskDefinitionRelProperties = ECSContainerDefinitionToTaskDefinitionRelProperties()
```

**Why This Matters:**
1. **Cleanup Operations**: Cartography uses the sub-resource relationship to determine which data to clean up during sync operations
2. **Data Organization**: Tenant-like objects provide natural boundaries for data organization
3. **Access Control**: Tenant relationships enable proper access control and data isolation
4. **Consistency**: Following this pattern ensures consistent data modeling across all modules

### Relationships

Define how your nodes connect to other nodes:

```python
from cartography.models.core.relationships import (
    CartographyRelSchema, CartographyRelProperties, LinkDirection,
    make_target_node_matcher, TargetNodeMatcher
)

# Relationship properties (usually just lastupdated)
@dataclass(frozen=True)
class YourServiceTenantToUserRelProperties(CartographyRelProperties):
    lastupdated: PropertyRef = PropertyRef("lastupdated", set_in_kwargs=True)

# The relationship itself
@dataclass(frozen=True)
class YourServiceTenantToUserRel(CartographyRelSchema):
    target_node_label: str = "YourServiceTenant"                # What we're connecting to
    target_node_matcher: TargetNodeMatcher = make_target_node_matcher({
        "id": PropertyRef("TENANT_ID", set_in_kwargs=True),     # Match on tenant.id = TENANT_ID kwarg
    })
    direction: LinkDirection = LinkDirection.INWARD             # Tenant points to User
    rel_label: str = "RESOURCE"                                 # Relationship label
    properties: YourServiceTenantToUserRelProperties = YourServiceTenantToUserRelProperties()
```

**Relationship Directions:**
- `LinkDirection.INWARD`: `(:YourServiceTenant)-[:RESOURCE]->(:YourServiceUser)` - Used for sub_resource relationships
- `LinkDirection.OUTWARD`: `(:YourServiceUser)-[:RESOURCE]->(:YourServiceTenant)` - Rarely used for RESOURCE

> For advanced relationship patterns (MatchLinks, one-to-many, cross-module relationships), see [Adding a New Relationship](add-relationship.md).

### Loading Data

Use the `load` function with your schema:

```python
from cartography.client.core.tx import load


def load_users(
    neo4j_session: neo4j.Session,
    data: list[dict[str, Any]],
    tenant_id: str,
    update_tag: int,
) -> None:
    # Load tenant first (if it doesn't exist)
    load(
        neo4j_session,
        YourServiceTenantSchema(),
        [{"id": tenant_id}],
        lastupdated=update_tag,
    )

    # Load users with relationships
    load(
        neo4j_session,
        YourServiceUserSchema(),
        data,
        lastupdated=update_tag,
        TENANT_ID=tenant_id,  # This becomes available as PropertyRef("TENANT_ID", set_in_kwargs=True)
    )
```

### Cleanup Jobs

Always implement cleanup to remove stale data:

```python
from cartography.graph.job import GraphJob

def cleanup(neo4j_session: neo4j.Session, common_job_parameters: dict[str, Any]) -> None:
    """
    Remove nodes that weren't updated in this sync run
    """
    logger.debug("Running Your Service cleanup job")

    # Cleanup users
    GraphJob.from_node_schema(YourServiceUserSchema(), common_job_parameters).run(neo4j_session)
```

### Analysis Jobs (Optional)

For modules that require post-ingestion graph enrichment (e.g., internet exposure analysis, permission inheritance), add analysis job calls at the end of your main ingestion function. See [Adding Analysis Jobs](analysis-jobs.md) for detailed patterns and examples.

```python
from cartography.util import run_analysis_job

@timeit
def start_your_module_ingestion(neo4j_session: neo4j.Session, config: Config) -> None:
    # ... sync all resources ...

    # Optional: Run analysis jobs after all data is synced
    run_analysis_job(
        "your_module_analysis.json",
        neo4j_session,
        common_job_parameters,
    )
```

## Configuration and Credentials

### Adding CLI Arguments

Add your configuration options in `cartography/cli.py`. The CLI uses [Typer](https://typer.tiangolo.com/) with options organized into help panels.

1. **Add a panel constant** at the top of the file:

```python
PANEL_YOUR_SERVICE = "Your Service Options"
```

2. **Add the panel to MODULE_PANELS** mapping:

```python
MODULE_PANELS = {
    # ... existing modules ...
    "yourservice": PANEL_YOUR_SERVICE,
}
```

3. **Add options** in the `run()` function inside `_build_app()`:

```python
# =================================================================
# Your Service Options
# =================================================================
your_service_api_key_env_var: Annotated[
    Optional[str],
    typer.Option(
        "--your-service-api-key-env-var",
        help="Environment variable name containing Your Service API key.",
        rich_help_panel=PANEL_YOUR_SERVICE,
        hidden=PANEL_YOUR_SERVICE not in visible_panels,
    ),
] = None,
your_service_tenant_id: Annotated[
    Optional[str],
    typer.Option(
        "--your-service-tenant-id",
        help="Your Service tenant ID.",
        rich_help_panel=PANEL_YOUR_SERVICE,
        hidden=PANEL_YOUR_SERVICE not in visible_panels,
    ),
] = None,
```

4. **Read secrets from environment** and pass to Config (in the `run()` function body):

```python
# Read Your Service API key
your_service_api_key = None
if your_service_api_key_env_var:
    your_service_api_key = os.environ.get(your_service_api_key_env_var)
```

5. **Add to Config constructor call**:

```python
config = cartography.config.Config(
    # ... existing fields ...
    your_service_api_key=your_service_api_key,
    your_service_tenant_id=your_service_tenant_id,
)
```

### Configuration Object

Add fields to `cartography/config.py`:

```python
class Config:
    def __init__(
        self,
        # ... existing fields ...
        your_service_api_key=None,
        your_service_tenant_id=None,
    ):
        # ... existing assignments ...
        self.your_service_api_key = your_service_api_key
        self.your_service_tenant_id = your_service_tenant_id
```

### Validation in Module

Always validate your configuration:

```python
def start_your_service_ingestion(neo4j_session: neo4j.Session, config: Config) -> None:
    # Validate required configuration
    if not config.your_service_api_key:
        logger.info("Your Service API key not configured - skipping module")
        return

    if not config.your_service_tenant_id:
        logger.info("Your Service tenant ID not configured - skipping module")
        return

    # Use the API key from config (already resolved from environment by CLI)
    api_key = config.your_service_api_key
```

## Testing Your Module

**Key Principle: Test outcomes, not implementation details.**

Focus on verifying that data is written to the graph as expected, rather than testing internal function parameters or implementation details. Mock external dependencies (APIs, databases) when necessary, but avoid brittle parameter testing.

### Test Data

Create mock data in `tests/data/your_service/`:

```python
# tests/data/your_service/users.py
MOCK_USERS_RESPONSE = {
    "users": [
        {
            "id": "user-123",
            "email": "alice@example.com",
            "display_name": "Alice Smith",
            "created_at": "2023-01-15T10:30:00Z",
            "last_login": "2023-12-01T14:22:00Z",
            "is_admin": False,
        },
        {
            "id": "user-456",
            "email": "bob@example.com",
            "display_name": "Bob Jones",
            "created_at": "2023-02-20T16:45:00Z",
            "last_login": None,  # Never logged in
            "is_admin": True,
        },
    ]
}
```

### Integration Tests

Test actual Neo4j loading in `tests/integration/cartography/intel/your_service/`:

```python
# tests/integration/cartography/intel/your_service/test_users.py
from unittest.mock import patch
import cartography.intel.your_service.users
from tests.data.your_service.users import MOCK_USERS_RESPONSE
from tests.integration.util import check_nodes
from tests.integration.util import check_rels


TEST_UPDATE_TAG = 123456789
TEST_TENANT_ID = "tenant-123"

@patch.object(
    cartography.intel.your_service.users,
    "get",
    return_value=MOCK_USERS_RESPONSE,
)
def test_sync_users(mock_api, neo4j_session):
    """
    Test that users sync correctly and create proper nodes and relationships
    """
    # Act - Use the sync function instead of calling load directly
    cartography.intel.your_service.users.sync(
        neo4j_session,
        "fake-api-key",
        TEST_TENANT_ID,
        TEST_UPDATE_TAG,
        {"UPDATE_TAG": TEST_UPDATE_TAG, "TENANT_ID": TEST_TENANT_ID},
    )

    # DO: Test outcomes - verify data is written to the graph as expected
    # Assert - Use check_nodes() instead of raw Neo4j queries.
    expected_nodes = {
        ("user-123", "alice@example.com"),
        ("user-456", "bob@example.com"),
    }
    assert check_nodes(neo4j_session, "YourServiceUser", ["id", "email"]) == expected_nodes

    # Verify tenant was created
    expected_tenant_nodes = {
        (TEST_TENANT_ID,),
    }
    assert check_nodes(neo4j_session, "YourServiceTenant", ["id"]) == expected_tenant_nodes

    # Assert relationships are created correctly.
    # Use check_rels() instead of raw Neo4j queries for relationships
    expected_rels = {
        ("user-123", TEST_TENANT_ID),
        ("user-456", TEST_TENANT_ID),
    }
    assert (
        check_rels(
            neo4j_session,
            "YourServiceUser",
            "id",
            "YourServiceTenant",
            "id",
            "RESOURCE",
            rel_direction_right=True,
        )
        == expected_rels
    )
```

**What to Test:**
- **Outcomes**: Nodes created with correct properties
- **Outcomes**: Relationships created between expected nodes

**What NOT to Test:**
- **Implementation**: Function parameters passed to mocks (brittle!)
- **Implementation**: Internal function call order
- **Implementation**: Mock call counts unless absolutely necessary

**When to Mock:**
- External APIs (AWS, Azure, third-party services) - provide test data
- Database connections - avoid real connections
- Network calls - avoid real network requests

**When NOT to Mock:**
- Internal Cartography functions
- Data transformation logic
- The function that is being tested

## Schema Documentation

Always document your schema in `docs/root/modules/your_service/schema.md`. Follow these formatting conventions:

### Documentation Conventions

1. **Title Levels**:
   - Use `###` (h3) for node names
   - Use `####` (h4) for the "Relationships" subsection

2. **Indexed Fields in Bold**:
   - Mark indexed fields (primary key, extra_index=True) with **bold** in the table
   - Example: `|**id**| The unique identifier|`

3. **Ontology Mapping Note** (if applicable):
   - Add a blockquote after the node description for nodes with semantic labels
   - See [Enriching the Ontology](enrich-ontology.md#documenting-ontology-integration) for the standard phrase format

### Example Documentation

```markdown
## Your Service Schema

### YourServiceUser

Represents a user in Your Service.

> **Ontology Mapping**: This node has the extra label `UserAccount` to enable cross-platform queries for user accounts across different systems (e.g., OktaUser, EntraUser, GSuiteUser).

| Field | Description |
|-------|-------------|
| firstseen | Timestamp of when a sync job first discovered this node |
| lastupdated | Timestamp of the last time the node was updated |
| **id** | Unique user identifier |
| **email** | User email address (indexed for queries) |
| name | User display name |
| created_at | Account creation timestamp |
| last_login | Last login timestamp |
| is_admin | Admin privileges flag |

#### Relationships

- YourServiceUser belong to YourServiceTenant.
    ```cypher
    (:YourServiceTenant)-[:RESOURCE]->(:YourServiceUser)
    ```

- YourServiceUser may be connected to Human nodes.
    ```cypher
    (:Human)-[:IDENTITY_YOUR_SERVICE]->(:YourServiceUser)
    ```
```

## File Structure Template

```
cartography/intel/your_service/
├── __init__.py          # Main entry point
└── entities.py          # Domain sync modules

cartography/models/your_service/
├── entity.py            # Data model definitions
└── tenant.py            # Tenant model

tests/data/your_service/
└── entities.py          # Mock test data

tests/unit/cartography/intel/your_service/
└── test_entities.py     # Unit tests

tests/integration/cartography/intel/your_service/
└── test_entities.py     # Integration tests
```

## Common Pitfalls

### Import Errors

```python
# Problem: ModuleNotFoundError for your new module
# Solution: Ensure __init__.py files exist in all directories
cartography/intel/your_service/__init__.py
cartography/models/your_service/__init__.py
```

### Schema Validation Errors

```python
# Problem: "PropertyRef validation failed"
# Solution: Check dataclass syntax and PropertyRef definitions
@dataclass(frozen=True)  # Don't forget frozen=True!
class YourNodeProperties(CartographyNodeProperties):
    id: PropertyRef = PropertyRef("id")  # Must have type annotation
```

### Relationships Not Created

```python
# Problem: Relationships not created
# Solution: Ensure target nodes exist before creating relationships
# Load parent nodes first:
load(neo4j_session, TenantSchema(), tenant_data, lastupdated=update_tag)
# Then load child nodes with relationships:
load(neo4j_session, UserSchema(), user_data, lastupdated=update_tag, TENANT_ID=tenant_id)
```

### Cleanup Job Failures

```python
# Problem: "GraphJob failed" during cleanup
# Solution: Check common_job_parameters structure
common_job_parameters = {
    "UPDATE_TAG": config.update_tag,  # Must match what's set on nodes
    "TENANT_ID": tenant_id,           # If using scoped cleanup (default)
}
```

### Date Handling

Neo4j 4+ supports native Python datetime objects and ISO 8601 strings:

```python
# DON'T: Manually parse dates or convert to epoch timestamps
"created_at": int(dt_parse.parse(user_data["created_at"]).timestamp() * 1000)

# DO: Pass datetime values directly - Neo4j handles them natively
"created_at": user_data.get("created_at")
"last_login": user_data.get("last_login")
```

### Performance Issues

```python
# Problem: Slow queries
# Solution: Add indexes to frequently queried fields
email: PropertyRef = PropertyRef("email", extra_index=True)

# Note: Fields in target_node_matcher are indexed automatically
```

## Coding Conventions

### Error Handling Principles

#### Fail Loudly When Assumptions Break

Cartography likes to fail loudly so that broken assumptions bubble exceptions up to operators instead of being papered over.

- When key assumptions your code relies upon stop being true, **stop execution immediately** and let the error propagate.
- Lean toward propagating errors up to callers instead of logging a warning inside a `try`/`except` block and continuing.
- If you're confident data should always exist, access it directly. Allow natural `KeyError`, `AttributeError`, or `IndexError` exceptions to signal corruption.
- Never manufacture "safe" default return values for required data.
- Avoid `hasattr()`/`getattr()` for required fields - rely on schemas and tests to detect breakage.

```python
# DON'T: Catch base exceptions and continue silently
try:
    risky_operation()
except Exception:
    logger.error("Something went wrong")
    pass  # Silently continue - BAD!

# DO: Let errors propagate or handle specifically
result = risky_operation()  # Let it fail if something is wrong
```

#### Required vs Optional Field Access

```python
def transform_user(user_data: dict[str, Any]) -> dict[str, Any]:
    return {
        # Required field - let it raise KeyError if missing
        "id": user_data["id"],
        "email": user_data["email"],

        # Optional field - gracefully handle missing data
        "name": user_data.get("display_name"),
        "phone": user_data.get("phone_number"),
    }
```

### Type Hints Style Guide

Use Python 3.9+ style type hints:

```python
# DO: Use built-in type hints (Python 3.9+)
def get_users(api_key: str) -> dict[str, Any]:
    ...

# DO: Use union operator for optional types
def process_user(user_id: str | None) -> None:
    ...

# DON'T: Use objects from typing module (Dict, List, Optional)
```

### Logging Guidelines

#### Log Levels

Use appropriate log levels to reduce noise in production:

| Level | Usage |
|-------|-------|
| `CRITICAL` | Framework-level component failures that cause cascading errors |
| `ERROR` | Explicit errors raised at the module level |
| `WARNING` | Transient errors or configuration issues that do not stop the module |
| `INFO` | High-level milestones (module start/finish) and significant summary statistics |
| `DEBUG` | Everything else: granular job details, empty result sets, raw data |

**Key Principle**: `INFO` should be reserved for actionable, high-level events. Empty states like "Loaded 0 results" or routine operations like "Graph job executed" belong in `DEBUG`.

```python
# DO: Use INFO for significant milestones
logger.info("Starting %s ingestion for tenant %s", module_name, tenant_id)
logger.info("Completed %s sync", module_name)

# DO: Use DEBUG for granular details
logger.debug("Running cleanup job for %s", schema_name)
logger.debug("Fetched %s results from API", len(results))
logger.debug("Transforming %s items", len(data))

# DON'T: Use INFO for routine operations
logger.info("Graph job executed")  # Should be DEBUG
logger.info("Fetched 0 users")     # Should be DEBUG
```

> **Note**: Do not log the number of nodes or relationships loaded. This is handled automatically by the `load()` function in `cartography/client/core/tx.py`.

#### Logging Format

Use lazy evaluation with `%s` formatting instead of f-strings. This avoids string interpolation when the log level is not active:

```python
# DO: Use % formatting (lazy evaluation)
logger.info("Processing %s users for tenant %s", count, tenant_id)
logger.debug("API response: %s", response_data)
logger.warning("Rate limited, retrying in %s seconds", retry_delay)

# DON'T: Use f-strings (eager evaluation)
logger.info(f"Processing {count} users for tenant {tenant_id}")
logger.debug(f"API response: {response_data}")
```

## Final Checklist

Before submitting your module:

- [ ] **Configuration**: CLI args, config validation, credential handling
- [ ] **Sync Pattern**: get() -> transform() -> load() -> cleanup()
- [ ] **Data Model**: Node properties, relationships, proper typing
- [ ] **Schema Fields**: Only use standard fields in `CartographyRelSchema`/`CartographyNodeSchema` subclasses
- [ ] **Scoped Cleanup**: Verify `scoped_cleanup=True` (default) for tenant-scoped resources
- [ ] **Error Handling**: Specific exceptions, required vs optional fields
- [ ] **Testing**: Integration tests for sync functions
- [ ] **Documentation**: Schema docs, docstrings, inline comments
- [ ] **Cleanup**: Proper cleanup job implementation
- [ ] **Indexing**: Extra indexes on frequently queried fields
- [ ] **Analysis Jobs** (optional): If your module needs post-ingestion enrichment, see [Analysis Jobs](analysis-jobs.md)