---
name: bigquery-etl-core
description: The core skill for working within the bigquery-etl repository. Use this skill when understanding project structure, conventions, and common patterns. Works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills.
---

# BigQuery ETL Core

**Composable:** Foundation skill that works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills
**When to use:** Understanding project structure, conventions, common patterns, and finding schema descriptions for construction

## Project Overview

The bigquery-etl project manages BigQuery table definitions, queries, and associated metadata for Mozilla. Similar to dbt, the repository maintains query definitions with associated metadata and schemas.

Each table/query typically consists of three files:
- `query.sql` OR `query.py` - The query definition (SQL or Python)
- `metadata.yaml` - Metadata about scheduling, ownership, and dependencies (see metadata-manager skill)
- `schema.yaml` - BigQuery schema definition with field types and descriptions (see metadata-manager skill)

**Note:** Most tables use `query.sql` (~95%). Use `query.py` for API calls, multi-project queries, or complex Python operations. See query-writer skill for details.

## 🚨 REQUIRED READING - Start Here

**When starting work in bigquery-etl, READ these foundational references:**

1. **Naming Conventions:** READ `references/naming_conventions.md`
   - Table naming patterns
   - Dataset organization
   - Version suffix conventions

2. **Dataset Organization:** READ `references/dataset_naming_conventions.md`
   - Common dataset suffixes (_derived, _stable, _live)
   - When to use each dataset type
   - Dataset naming rules

3. **Schema Resources:** READ `references/discovery_resources.md`
   - Schema description sources (Glean Dictionary, ProbeInfo API, DataHub)
   - Priority order for schema lookup during construction
   - Common mozfun UDFs

4. **Privacy Guidelines:** READ `references/privacy_guidelines.md`
   - Data handling requirements
   - PII considerations
   - Workgroup access patterns

## Directory Structure

```
sql/{project}/{dataset}/{table_name}/
├── query.sql OR query.py
├── metadata.yaml
└── schema.yaml
```

See `assets/directory_structure_example.txt` for detailed examples.

**Key principles:**
- Always flat: `sql/{project}/{dataset}/{table_name}/`
- Never use subdirectories within table directories
- Table names always include version suffix (`_v1`, `_v2`, etc.)

## Schema & Description Resources for Construction

### Finding Schema Descriptions

**Priority order for schema lookup during construction:**

1. **Local files first:** Check `sql/*/schema.yaml` and `metadata.yaml` files
   - Most reliable and up-to-date source
   - Contains field descriptions written by table owners

2. **Glean Dictionary:** For `_live` and `_stable` tables
   - URL: https://dictionary.telemetry.mozilla.org/
   - Contains metric descriptions from Glean schema definitions
   - Use WebFetch with targeted prompts to extract specific field descriptions

3. **ProbeInfo API:** For Glean metric metadata
   - Endpoints: `https://probeinfo.telemetry.mozilla.org/glean/{product}/metrics`
   - Provides metric definitions and descriptions programmatically
   - Use for validating metric references in queries

4. **DataHub MCP:** Only as last resort
   - **MUST READ `references/datahub_best_practices.md` BEFORE any DataHub queries**
   - Use for schema lookup when not available in local files or Glean Dictionary
   - Extract ONLY necessary fields (column names, types, descriptions)
   - Use for downstream impact analysis when modifying tables

**See `references/discovery_resources.md` for:**
- Detailed guidance on each schema source
- ProbeInfo API endpoints and usage patterns
- Glean Dictionary URL patterns for different products
- DataHub MCP best practices for construction
- Common mozfun UDFs
- Key documentation links

## Naming Conventions

**Table Names:**
- Use snake_case with version suffix: `clients_daily_event_v1`
- Common suffixes: `_daily`, `_hourly`, `_aggregates`, `_summary`

**Field Names:**
- Use snake_case: `submission_date`, `client_id`, `n_total_events`
- Prefix counts with `n_`: `n_events`, `n_sessions`
- Standard Mozilla fields: `submission_date`, `client_id`, `sample_id`, `normalized_channel`, `normalized_country_code`, `app_version`

**See `references/naming_conventions.md` for:**
- Complete naming patterns and conventions
- Reserved/common patterns to avoid
- BigQuery project naming conventions

## Dataset Organization

**See `references/dataset_naming_conventions.md` for:**
- Dataset naming patterns by suffix (`_derived`, `_external`, etc.)
- Common dataset prefixes by product/source
- Table versioning patterns
- Incremental vs full refresh query patterns

## Privacy & Data Handling

Mozilla follows strict data privacy policies:
- No PII in derived tables
- Use client-level identifiers (`client_id`) not individual identifiers
- Respect data retention policies (~2 years for client-level data)
- Label client-level tables with `table_type: client_level` in metadata.yaml

**See `references/privacy_guidelines.md` for:**
- Key principles from Mozilla's data platform
- Geo IP lookup and user agent parsing policies
- Best practices for data handling
- Deletion request support
- Sample ID usage for sampling

## BigQuery & Mozilla Conventions

### Partitioning & Clustering

- Most tables use day partitioning on `submission_date`
- Clustering improves query performance for filtered/joined fields
- See metadata-manager skill for detailed partitioning and clustering configuration

### Common UDFs (mozfun)

Browse available functions: https://mozilla.github.io/bigquery-etl/mozfun/

Common functions:
- `mozfun.map.get_key()` - Extract values from key-value maps
- `mozfun.norm.truncate_version()` - Normalize version strings
- `mozfun.stats.mode_last()` - Statistical mode calculation

UDF source code in `sql/mozfun/` directory.

## Glean Overview

Glean is Mozilla's product analytics & telemetry solution, providing consistent measurement across all Mozilla products.

**Key concepts:**
- **Metric types**: Counter, boolean, string, event, etc.
- **Pings**: Collections of metrics (e.g., `baseline`, `events`, `metrics`)
- **Applications**: Products using Glean (Fenix, Focus, Firefox iOS, etc.)

**Common Glean datasets in BigQuery:**
- Pattern: `{app_id}.{ping_name}` (e.g., `org_mozilla_fenix.baseline`)
- All have auto-generated schemas based on metric definitions

**See `references/glean_overview.md` for:**
- What is Glean and how it differs from Firefox Desktop Telemetry
- Glean SDK and metric type details
- Common Glean datasets in BigQuery
- When to use Glean Dictionary

## bigquery-etl CLI Commands

**See `references/bqetl_cli_commands.md` for:**
- Key bqetl CLI commands for query creation, validation, schema updates
- How to find the right DAG for scheduling
- Backfill creation commands

## Best Practices

**General principles:**
- Always include field descriptions in schema.yaml (see metadata-manager skill)
- Add header comments explaining query purpose (see query-writer skill)
- Reference bug/ticket numbers for context
- Document any data exclusions or filtering logic

**See `assets/query_structure_example.sql` for standard query structure.**

**Version migration:**
- Create new `_v2` table when making breaking schema changes
- Keep `_v1` running during migration period
- Update views to point to new version
- Coordinate with downstream consumers before deprecating old version

**For detailed best practices, see:**
- Query writing: query-writer skill
- Metadata configuration: metadata-manager skill
- Performance optimization: https://docs.telemetry.mozilla.org/cookbooks/bigquery/optimization.html
- Recommended practices: https://mozilla.github.io/bigquery-etl/reference/recommended_practices/

## Integration with Other Skills

**bigquery-etl-core** serves as the foundation skill that other skills build upon:

### Works with model-requirements
- Provides naming conventions for new tables and datasets
- Supplies common field naming patterns for requirements gathering
- Offers privacy guidelines for data model planning

### Works with query-writer
- Provides project structure and naming conventions
- Supplies common patterns and mozfun UDF references
- Offers schema description lookup guidance for construction

### Works with metadata-manager
- Provides DAG naming patterns and scheduling conventions
- Supplies partitioning and clustering best practices
- Offers ownership and labeling patterns

### Works with sql-test-generator
- Provides test structure and fixture naming conventions
- Supplies common table patterns for test creation
- Offers query parameter conventions

### Works with bigconfig-generator
- Provides table naming conventions for Bigeye monitoring configuration
- Supplies dataset organization patterns
- Offers field naming standards for data quality checks

**This skill is always available and does not need to be explicitly invoked** - it provides foundational knowledge that other skills reference.

## Reference Examples

Real query examples in the repository:
- **Simple query**: `sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/query.sql`
- **Aggregation with GROUP BY**: `sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_event_v1/query.sql`
- **Complex query with CTEs**: `sql/moz-fx-data-shared-prod/telemetry_derived/event_events_v1/query.sql`
- **Python ETL (INFORMATION_SCHEMA)**: `sql/moz-fx-data-shared-prod/monitoring_derived/bigquery_table_storage_v1/query.py`
- **Python ETL (External API)**: `sql/moz-fx-data-shared-prod/bigeye_derived/user_service_v1/query.py`

For more examples, explore the `sql/moz-fx-data-shared-prod/` directory.

## Bundled Resources

### References
- `references/discovery_resources.md` - Schema description sources (Glean Dictionary, ProbeInfo API, DataHub MCP), priority order for construction, documentation links
- `references/naming_conventions.md` - Complete naming patterns for tables, fields, and projects
- `references/dataset_naming_conventions.md` - Dataset organization and versioning patterns
- `references/privacy_guidelines.md` - Mozilla data privacy policies and best practices
- `references/glean_overview.md` - Glean SDK concepts and BigQuery dataset structures
- `references/bqetl_cli_commands.md` - Key CLI commands and DAG discovery

### DataHub Usage (CRITICAL for Token Efficiency)

**BEFORE using any DataHub MCP tools (`mcp__datahub-cloud__*`), you MUST:**
- **READ `references/datahub_best_practices.md`** - Comprehensive token optimization strategies
- Follow priority order: local files → documentation → DataHub (only as last resort)
- Use search-first patterns and extract minimal fields from responses

### Assets
- `assets/query_structure_example.sql` - Standard query.sql structure with common patterns
- `assets/directory_structure_example.txt` - File organization examples