<!--
DOCUMENT METADATA
=================
Title: OpenSpec Agent Instructions
Type: Guide
Category: AI Configuration
Status: Active
Version: 1.0
Created: 2024-10-01
Last Updated: 2025-10-29
Author: Project Team

Related Documents:
- CLAUDE.md (Claude assistant configuration)
- README.md
- All change specifications (docs/00*.md)

Related Code:
- openspec/ (OpenSpec framework)

Dependencies:
- OpenSpec framework

Keywords: openspec, agents, ai-workflow, change-management, specifications, proposals, automation

Summary:
Detailed instructions for AI agents working with OpenSpec framework on this project. Defines workflows for creating proposals, implementing changes, and archiving completed work. Includes file conventions, naming patterns, status tracking, and integration with project documentation structure.

Audience:
AI assistants (Claude), developers using AI-assisted workflows, project maintainers.
-->

<!-- OPENSPEC:START -->
# OpenSpec Instructions

These instructions are for AI assistants working in this project.

Always open `@/openspec/AGENTS.md` when the request:
- Mentions planning or proposals (words like proposal, spec, change, plan)
- Introduces new capabilities, breaking changes, architecture shifts, or big performance/security work
- Sounds ambiguous and you need the authoritative spec before coding

Use `@/openspec/AGENTS.md` to learn:
- How to create and apply change proposals
- Spec format and conventions
- Project structure and guidelines

Keep this managed block so 'openspec update' can refresh the instructions.

<!-- OPENSPEC:END -->

# AGENTS.md - OEVK Data Processing Project

## Build/Lint/Test Commands
- **Test command**: `python -m pytest tests/`
- **Single test**: `python -m pytest tests/test_file.py::test_function -v`
- **Lint**: `ruff check .`
- **Type check**: `mypy .`
- **Release workflow**: `python -m src.cli release create --auto`
- **Deduplication tests**: `python -m pytest tests/contract/test_deduplication.py tests/integration/test_deduplication_*.py tests/unit/test_deduplication_*.py -v`

## Release Workflow Commands

### Data Validation
- **Validate release data**: `python -m src.cli release validate --staging-dir data/staging --exports-dir exports`
- **Validate with custom directories**: `python -m src.cli release validate --staging-dir /path/to/staging --exports-dir /path/to/exports`
- **Verbose validation with debug logging**: `python -m src.cli --verbose release validate --staging-dir data/staging --exports-dir exports`

### Release Creation
- **Create release with auto-generated tag**: `python -m src.cli release create --repo-owner owner --repo-name repo --auto`
- **Create release with specific tag**: `python -m src.cli release create --repo-owner owner --repo-name repo --tag 20250101-1200`
- **Create draft release**: `python -m src.cli release create --repo-owner owner --repo-name repo --auto --draft`
- **Create prerelease**: `python -m src.cli release create --repo-owner owner --repo-name repo --auto --prerelease`
- **Force overwrite existing release**: `python -m src.cli release create --repo-owner owner --repo-name repo --tag existing-tag --force`
- **Create packages without upload**: `python -m src.cli release create --repo-owner owner --repo-name repo --auto --skip-upload`

### Release Management
- **Check release status**: `python -m src.cli release status --repo-owner owner --repo-name repo --tag 20250101-1200`
- **List recent releases**: `python -m src.cli release history --repo-owner owner --repo-name repo --limit 10`

## Deduplication Commands

### Basic Deduplication
- **Run deduplication with report**: `python -c "from src.etl.deduplicate import AddressDeduplicator; dedup = AddressDeduplicator(); result = dedup.deduplicate_addresses(addresses_df)"`
- **Run deduplication without report**: `python -c "from src.etl.deduplicate import AddressDeduplicator; dedup = AddressDeduplicator(); result = dedup.deduplicate_addresses(addresses_df, generate_report=False)"`
- **Process large datasets**: `python -c "from src.etl.deduplicate import deduplicate_large_dataset; result = deduplicate_large_dataset(addresses_df, chunk_size=100000)"`

### Report Generation
- **Generate deduplication report**: `python -c "from src.etl.deduplicate import AddressDeduplicator; dedup = AddressDeduplicator(); report = dedup.generate_deduplication_report(addresses_df, result, processing_time_ms)"`
- **Export report to JSON**: `python -c "from src.etl.deduplicate import AddressDeduplicator; dedup = AddressDeduplicator(); json_report = dedup.export_report_to_json(report)"`

### Testing
- **Run deduplication tests**: `python -m pytest tests/contract/test_deduplication.py -v`
- **Run integration tests**: `python -m pytest tests/integration/test_deduplication_*.py -v`
- **Run unit tests**: `python -m pytest tests/unit/test_deduplication_*.py -v`
- **Run all deduplication tests**: `python -m pytest tests/contract/test_deduplication.py tests/integration/test_deduplication_*.py tests/unit/test_deduplication_*.py -v`

### Environment Variables
- **GitHub Token**: `GITHUB_TOKEN` (required for release operations)
- **Default directories**: `STAGING_DIR=data/staging`, `EXPORTS_DIR=exports`

## Complete Data Processing Pipeline

The OEVK data processing pipeline includes the following stages:

### Pipeline Stages
1. **Ingestion**: Downloads and loads source data from valasztas.hu
2. **Transformation**: Normalizes data into 8 core tables with parallel processing
3. **Public Space Extraction**: Extracts public space entities (names and types) from addresses
4. **Export**: Creates CSV files for all tables including public space data
5. **Release**: Packages and publishes data to GitHub releases

### Public Space Extraction Features
- **Entity Recognition**: Extracts public space names and types from addresses
- **Relationship Mapping**: Creates settlement-public space relationships
- **Hash-based IDs**: Deterministic xxhash64 identifiers for all entities
- **Data Integrity**: Full validation and referential integrity
- **Export Support**: CSV export for all public space entities

### Address Deduplication Features
- **Duplicate Identification**: Identifies duplicate addresses using deterministic hash IDs
- **Canonical Address Creation**: Creates unique canonical address records
- **Relationship Preservation**: Maintains all original relationships (polling stations, PIR codes)
- **Report Generation**: Generates comprehensive deduplication reports with statistics
- **JSON Export**: Exports deduplication reports to JSON format for external analysis
- **Large Dataset Support**: Processes large datasets in chunks for optimal performance

### Public Space Tables
1. **PublicSpaceName**: Unique public space names extracted from addresses
2. **PublicSpaceType**: Unique public space types (utca, tér, etc.)
3. **SettlementPublicSpaces**: Many-to-many relationships between settlements and public spaces

### Deduplication Tables
1. **CanonicalAddresses**: Unique canonical address records after deduplication
2. **AddressMapping**: Mapping between original addresses and canonical addresses
3. **AddressPollingStations**: Preserved polling station assignments
4. **AddressPIRCodes**: Preserved PIR code relationships
5. **DeduplicationReport**: Audit reports with deduplication statistics

## Code Style Guidelines

### Language & Framework
- **Primary language**: Python 3.11+
- **Data processing**: Polars or DuckDB for large datasets (>3M rows)
- **Database**: SQLite/DuckDB for staging/target (single-file, zero admin)

### Imports & Structure
- Use absolute imports
- Group imports: standard library, third-party, local modules
- Follow PEP 8 for Python code
- Use type hints for all functions and variables

### Naming Conventions
- **Variables**: snake_case (`county_code`, `settlement_name`)
- **Functions**: snake_case (`load_addresses_csv`, `transform_data`)
- **Classes**: PascalCase (`CountyLoader`, `AddressTransformer`)
- **Constants**: UPPER_SNAKE_CASE (`DATA_DIR`, `CHUNK_SIZE`)

### Data Processing Patterns
- Use deterministic hash IDs for all entities (xxhash64 recommended)
- Trim whitespace, convert empty strings to `NULL`
- Preserve diacritics and original casing for Hungarian text
- Use vectorized operations, avoid Python row loops
- Process data in chunks (100k-500k rows)

### Error Handling
- Use structured logging with levels (INFO/DEBUG)
- Validate data quality with referential integrity checks
- Stage invalid rows with error codes and messages
- Make operations idempotent and restartable

### File Organization
- Keep SQL DDL in separate files
- Modular scripts: ingest.py, transform.py, export.py, release/
- Separate configuration from business logic
- Use environment variables for configuration

### Release Workflow Architecture

#### Core Modules
- **Workflow Orchestrator**: `src/release/workflow.py` - Coordinates complete release process
- **Data Validation**: `src/release/validation.py` - Pre-release integrity and quality checks
- **File Packaging**: `src/release/packaging.py` - Creates compressed ZIP archives
- **GitHub Integration**: `src/release/github.py` - GitHub CLI integration for releases
- **Data Models**: `src/release/models.py` - ReleasePackage, ReleaseArtifact, ReleaseMetadata

#### Release Artifacts
- **CSV Archive**: `oevk-data-csv-{tag}.zip` - Contains all CSV files:
  - `addresses/` - Directory containing split address files by settlement (e.g., `Address_001_Budapest.csv`, `Address_002_Debrecen.csv`)
  - `settlements.csv` - Settlement reference data
  - `counties.csv` - County reference data
  - `NationalIndividualElectoralDistrict.csv` - National electoral districts
  - `PollingStation.csv` - Polling station locations
  - `PostalCode.csv` - Postal code data
  - `PostalCode_Settlement.csv` - Postal code to settlement mapping
  - `SettlementIndividualElectoralDistrict.csv` - Settlement to electoral district mapping
  - `PublicSpaceName.csv` - Unique public space names extracted from addresses
  - `PublicSpaceType.csv` - Unique public space types (utca, tér, etc.)
  - `SettlementPublicSpaces.csv` - Many-to-many relationships between settlements and public spaces
- **Database Archive**: `oevk-data-db-{tag}.zip` - Contains oevk.db (main transformed database)
- **Release Metadata**: JSON metadata with validation results and performance metrics

#### Release Tags
- **Format**: YYYYMMDD-HHMM (timestamp-based to prevent duplicates)
- **Auto-generation**: Uses current timestamp when not specified
- **Validation**: Ensures unique tags to prevent conflicts

#### Data Validation
- **File Existence**: Verifies all required files exist
- **File Sizes**: Ensures files have reasonable sizes
- **File Integrity**: Validates files are readable and not corrupted
- **Data Completeness**: Checks for required headers and data
- **Referential Integrity**: Validates relationships between entities
- **Data Freshness**: Ensures data is recent (≤24 hours old)

#### Performance Targets
- **Complete Workflow**: ≤15 minutes for full release process
- **Data Validation**: ≤2 minutes for comprehensive checks
- **Package Creation**: ≤5 minutes for artifact compression
- **GitHub Integration**: ≤3 minutes for release creation
- **CSV Export**: ~2.6 minutes for 3.3M addresses (single-query optimization)
- **Idempotent Operations**: Safe to retry failed operations

### CSV Export Optimization

The project uses an optimized single-query approach for exporting canonical addresses:

#### Export Performance
- **Approach**: Single database query + Python partitioning
- **Performance**: ~2.6 minutes for 3.3M addresses across 3,177 settlements
- **Speed**: ~21,000 addresses/second
- **Improvement**: ~17x faster than per-settlement query approach
- **Query Time**: ~24 seconds to fetch all addresses with JOINs
- **Write Time**: ~2.2 minutes to write 3,177 CSV files

#### Export Features
- **Automatic Cleanup**: Removes old export files before new export
- **Incremental Progress**: Logs progress every 500 settlements
- **Memory Efficient**: Streaming write approach
- **Symlink Management**: Creates symlinks for release validation
- **Default Directory**: `exports/` (aligned with release workflow)

#### Export Commands
```bash
# Basic export to default directory (exports/)
python -m src.cli export --db-path data/oevk.db

# Export with custom output directory
python -m src.cli export --db-path data/oevk.db --output-dir /path/to/output

# Export with custom run tag
python -m src.cli export --run-tag "v1.0.0"

# Export only entity tables (skip addresses)
python -m src.cli export --tables-only

# Export only addresses (skip entity tables)
python -m src.cli export --addresses-only
```

#### Export Structure
```
exports/
├── {run_tag}_Address/              # 3,177 settlement CSV files
│   ├── Address_001_Aba.csv
│   ├── Address_002_Abádszalók.csv
│   └── ...
├── {run_tag}_County.csv            # Entity tables
├── {run_tag}_Settlement.csv
├── ... (8 more entity tables)
├── addresses -> {run_tag}_Address  # Symlinks for release
├── settlements.csv -> {run_tag}_Settlement.csv
├── counties.csv -> {run_tag}_County.csv
└── database.duckdb -> ../data/oevk.db
```

## Release Workflow Usage Examples

### Basic Release Creation
```bash
# Set GitHub token (required)
export GITHUB_TOKEN="ghp_your_token_here"

# Create release with auto-generated tag
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto

# Create release with specific tag
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --tag 20250101-1200
```

### Advanced Release Scenarios
```bash
# Create draft release for review
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --draft

# Create prerelease (beta/alpha)
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --prerelease

# Force overwrite existing release
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --tag existing-tag --force

# Create packages without uploading to GitHub (local testing)
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --skip-upload

# Validate data before release
python -m src.cli release validate --staging-dir data/staging --exports-dir exports
```

### Release Management
```bash
# Check release status
python -m src.cli release status --repo-owner your-org --repo-name oevk-data --tag 20250101-1200

# List recent releases
python -m src.cli release history --repo-owner your-org --repo-name oevk-data --limit 10

# Get detailed release information
python -m src.cli release info --repo-owner your-org --repo-name oevk-data --tag 20250101-1200
```

## Environment Setup

### Required Environment Variables
```bash
# GitHub Personal Access Token (required for releases)
export GITHUB_TOKEN="ghp_your_token_here"

# Optional: Custom directories
export STAGING_DIR="/path/to/staging"
export EXPORTS_DIR="/path/to/exports"
```

### GitHub Token Permissions
- **repo** (full repository access) - **REQUIRED**
- **workflow** (if using GitHub Actions)
- **read:org** (if accessing organization repositories)

**IMPORTANT**: For organization repositories, use **classic personal access tokens** instead of fine-grained tokens. Classic tokens have better organization repository upload permissions.

### Prerequisites
- **Python 3.11+** with required dependencies
- **GitHub CLI (gh)** installed and authenticated
- **GitHub Personal Access Token** with appropriate permissions
- **Data processing pipeline** completed (staging and exports directories populated)

## Troubleshooting

### Common Issues

#### GitHub Authentication
```bash
# Verify GitHub CLI authentication
gh auth status

# Login if needed
gh auth login

# Set token explicitly
gh auth login --with-token <<< "$GITHUB_TOKEN"

# For organization repositories, use classic tokens instead of fine-grained tokens:
# 1. Go to GitHub Settings > Developer settings > Personal access tokens > Tokens (classic)
# 2. Create a new classic token with "repo" scope
# 3. Use the classic token (starts with "gho_") instead of fine-grained tokens (start with "github_pat_")
```

#### Release Creation Failures
```bash
# Check if release already exists
gh release view 20250101-1200 --repo your-org/oevk-data

# Delete existing release if needed
gh release delete 20250101-1200 --repo your-org/oevk-data --yes

# Force recreate release
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --tag 20250101-1200 --force

# Test GitHub connection and permissions
gh auth status
gh api repos/your-org/oevk-data
```

#### Data Validation Issues
```bash
# Run validation with verbose output
python -m src.cli --verbose release validate --staging-dir data/staging --exports-dir exports

# Check file permissions
ls -la data/staging/
ls -la exports/

# Verify file contents
head -n 5 data/staging/addresses.csv
head -n 5 exports/addresses.csv
```

#### Performance Issues
```bash
# Run performance tests
python -m pytest tests/performance/ -v

# Monitor system resources during release
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --monitor

# Check disk space
df -h
du -sh data/staging/ exports/
```

### Debug Mode
```bash
# Enable debug logging
export LOG_LEVEL="DEBUG"
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto

# Enable verbose logging with --verbose flag
python -m src.cli --verbose release validate --staging-dir data/staging --exports-dir exports

# Dry run (validate without creating release)
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --dry-run

# Test with explicit GitHub token
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --github-token $GITHUB_TOKEN
```

### Organization Repository Troubleshooting

#### Common Organization Repository Issues

**Error**: `HTTP 403: Resource not accessible by personal access token`

**Cause**: Fine-grained personal access tokens have upload permission issues with organization repositories

**Solution**: Use classic personal access tokens instead of fine-grained tokens

```bash
# 1. Regenerate GitHub token as classic token
# Go to GitHub Settings > Developer settings > Personal access tokens > Tokens (classic)
# Create new classic token with these scopes:
# - repo (full repository access)
# - workflow (if using GitHub Actions)
# - read:org (organization read access)

# 2. Update environment variable
export GITHUB_TOKEN="your_classic_token_here"  # Should start with "gho_"

# 3. Test repository access
gh api repos/your-org/your-repo

# 4. Test release creation and upload
gh release create test-release --repo your-org/your-repo --title "Test" --notes "Testing"
echo "test file" > test_upload.txt
gh release upload test-release test_upload.txt --repo your-org/your-repo
```

#### Token Type Identification
```bash
# Check token type
export GITHUB_TOKEN="your_token_here"
gh auth status --show-token

# Fine-grained tokens start with: github_pat_...
# Classic tokens start with: gho_...
```

#### Skip-Upload Workaround

If organization permissions can't be resolved immediately, use skip-upload:
```bash
# Create packages without uploading to GitHub
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --skip-upload

# Packages will be saved to data/temp/ for manual upload
ls -la data/temp/

# Manual upload to GitHub release
gh release upload 20250101-1200 data/temp/oevk-data-csv-20250101-1200.zip data/temp/oevk-data-db-20250101-1200.zip --repo your-org/oevk-data
```

#### Verify Organization Access
```bash
# Check current token scopes
gh auth status

# Test repository permissions
gh api repos/your-org/your-repo | jq '.permissions'

# Expected output should include:
# {
#   "admin": true,
#   "maintain": true,
#   "push": true,
#   "triage": true,
#   "pull": true
# }
```

#### Skip-Upload Workaround

If organization permissions can't be resolved immediately, use skip-upload:
```bash
# Create packages without uploading to GitHub
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --skip-upload

# Packages will be saved to data/temp/ for manual upload
ls -la data/temp/

# Manual upload to GitHub release
gh release upload 20250101-1200 data/temp/oevk-data-csv-20250101-1200.zip data/temp/oevk-data-db-20250101-1200.zip --repo your-org/oevk-data
```

#### Verify Organization Access
```bash
# Check current token scopes
gh auth status

# Test repository permissions
gh api repos/your-org/your-repo | jq '.permissions'

# Expected output should include:
# {
#   "admin": true,
#   "maintain": true,
#   "push": true,
#   "triage": true,
#   "pull": true
# }
```

## Performance Monitoring

### Release Performance Commands
```bash
# Run performance benchmarks
python -m pytest tests/performance/test_release_performance.py -v

# Monitor release timing
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --timing

# Generate performance report
python -m src.cli release performance --repo-owner your-org --repo-name oevk-data --tag 20250101-1200
```

### Expected Performance Metrics
- **Total Release Time**: ≤15 minutes
- **Data Validation**: ≤2 minutes  
- **Package Creation**: ≤5 minutes
- **GitHub Operations**: ≤3 minutes
- **File Compression**: ≤2 minutes

### Resource Requirements
- **Memory**: ≥4GB RAM for large datasets
- **Disk Space**: ≥2GB free space for temporary files
- **Network**: Stable internet connection for GitHub operations
- **CPU**: Multi-core processor for parallel operations