---
name: aggregating-event-datasets
description: Aggregate and summarize event datasets (logs) using OPAL statsby. Use when you need to count, sum, or calculate statistics across log events. Covers make_col for derived columns, statsby for aggregation, group_by for grouping, aggregation functions (count, sum, avg, percentile), and topk for top N results. Returns single summary row per group across entire time range. For time-series trends, see time-series-analysis skill.
---

# Aggregating Event Datasets

Event datasets (logs) can be aggregated to create summaries and statistics. This skill teaches you how to use `statsby` to aggregate log data into meaningful insights using OPAL.

## When to Use This Skill

- Counting occurrences (error count by namespace, log volume by pod)
- Calculating statistics (average, sum, percentiles) across events
- Grouping events by dimensions (namespace, pod, container, service)
- Finding top N results by a metric (top 10 error sources, busiest pods)
- Creating summary reports across entire time range

**Note**: This skill covers `statsby` which returns **one summary row per group** across the entire time range. For time-series trends (multiple rows per group over time), see the `time-series-analysis` skill.

## Prerequisites

- Access to Observe tenant via MCP
- Understanding of event datasets (see filtering-event-datasets skill)
- Dataset with `log` interface (or any Event dataset)

## Key Concepts

### statsby - Statistical Aggregation

`statsby` is the primary aggregation verb for event datasets. It:
- Groups events by specified dimensions
- Applies aggregation functions (count, sum, avg, etc.)
- Returns **one row per group** across the entire query time range

**Syntax**:
```opal
statsby aggregation_function(), group_by(dimension1, dimension2, ...)
```

### Common Aggregation Functions

- `count()` - Count number of events
- `sum(field)` - Sum values of a field
- `avg(field)` - Average value of a field
- `min(field)` - Minimum value
- `max(field)` - Maximum value
- `percentile(field, p)` - Percentile (e.g., p=0.95 for 95th percentile)
- `any_not_null(field)` - Any non-null value from the group

### topk vs sort/limit

- **`topk N, max(metric)`** - Get top N results by a specific metric (semantically correct for "top performers")
- **`sort desc(metric) | limit N`** - Alternative but less clear intent
- **Use topk** for aggregated results - it's more explicit about intent

## Discovery Workflow

Start with dataset discovery (same as filtering-event-datasets):

**Step 1: Find dataset**
```
discover_context("kubernetes logs")
```

**Step 2: Get schema**
```
discover_context(dataset_id="YOUR_DATASET_ID")
```

Note fields you'll use for:
- **Filtering** (before aggregation)
- **Grouping** (dimensions to aggregate by)
- **Calculating** (fields to sum, average, etc.)

## Basic Patterns

### Pattern 1: Simple Count

**Use case**: Count total events

```opal
statsby count()
```

**Explanation**: Counts all events in the time range. Returns single row with total count.

**Output**:
```
count
5831
```

### Pattern 2: Count by Dimension

**Use case**: Count events grouped by a field (e.g., namespace)

```opal
make_col namespace:string(resource_attributes."k8s.namespace.name")
| statsby count(), group_by(namespace)
| topk 10, max(count)
```

**Explanation**:
1. `make_col` creates a derived column `namespace` from nested field
2. `statsby` counts events, grouped by namespace
3. `topk` returns top 10 namespaces by count

**Output**:
```
namespace,count,_c_rank
default,5805,1
kube-system,648,2
observe,64,3
```

### Pattern 3: Count with Filtering

**Use case**: Count errors per namespace

```opal
filter contains(body, "error")
| make_col namespace:string(resource_attributes."k8s.namespace.name")
| statsby error_count:count(), group_by(namespace)
| topk 10, max(error_count)
```

**Explanation**: Filters for errors first, then counts by namespace. Notice we name the count `error_count` for clarity.

### Pattern 4: Multiple Dimensions

**Use case**: Count by namespace AND pod

```opal
make_col
    namespace:string(resource_attributes."k8s.namespace.name"),
    pod:pod
| statsby count(), group_by(namespace, pod)
| topk 20, max(count)
```

**Explanation**: Groups by multiple dimensions. Each unique (namespace, pod) combination gets one row.

### Pattern 5: Multiple Aggregations

**Use case**: Calculate multiple statistics in one query

```opal
filter stream = "stderr"
| make_col namespace:string(resource_attributes."k8s.namespace.name")
| statsby
    stderr_count:count(),
    group_by(namespace)
| topk 10, max(stderr_count)
```

**Explanation**: You can calculate multiple aggregations in a single `statsby` call.

## Complete Example

End-to-end workflow for analyzing errors across your infrastructure.

**Scenario**: Find which services, namespaces, and pods are producing the most errors in the last 24 hours.

**Step 1: Discovery**

```
discover_context("kubernetes logs")
```

Found: Dataset "Kubernetes Explorer/Kubernetes Logs" (ID: 42161740)

**Step 2: Build query**

```opal
filter contains(body, "error") or contains(body, "ERROR")
| make_col
    namespace:string(resource_attributes."k8s.namespace.name"),
    pod:pod,
    container:container
| statsby error_count:count(), group_by(namespace, pod, container)
| topk 20, max(error_count)
```

**Step 3: Execute**

```
execute_opal_query(
    query="[query above]",
    primary_dataset_id="42161740",
    time_range="24h"
)
```

**Step 4: Interpret results**

```csv
namespace,pod,container,error_count,_c_rank
kube-system,calico-node-74d4r,calico-node,33,1
kube-system,calico-node-hhvbf,calico-node,31,2
kube-system,calico-node-ghk2s,calico-node,31,3
kube-system,calico-kube-controllers-759cd8b574-fzr49,calico-kube-controllers,31,4
```

**Analysis**:
- Most errors are in `kube-system` namespace
- `calico-node` pods are the primary error source
- All errors are from the same container (`calico-node`)
- Total of 126 errors across top 4 sources in 24h

**Next steps**: Investigate the specific calico-node errors to understand the root cause.

## Advanced Patterns

### Pattern 6: Conditional Aggregation

**Use case**: Count errors vs total, calculate error rate

```opal
make_col
    namespace:string(resource_attributes."k8s.namespace.name"),
    is_error:if(contains(body, "error"), 1, 0)
| statsby
    total:count(),
    error_count:sum(is_error),
    group_by(namespace)
| make_col error_rate:float64(error_count)/float64(total)
| topk 10, max(error_rate)
```

**Explanation**:
1. Create boolean flag `is_error` (1 or 0)
2. Count total events and sum error flags
3. Calculate error rate as derived column
4. Show top 10 by error rate

**Note**: OPAL doesn't have `count_if()`, so use `if()` + `sum()` pattern.

### Pattern 7: Type Conversions

**Use case**: Safely handle type conversions for nested fields

```opal
make_col
    namespace:string(resource_attributes."k8s.namespace.name"),
    pod:string(pod),
    container:string(container)
| statsby count(), group_by(namespace, pod, container)
| topk 20, max(count)
```

**Explanation**: Wrap fields in `string()`, `int64()`, `float64()` for type safety, especially with nested fields.

## Common Pitfalls

### Pitfall 1: Forgetting make_col Before statsby

❌ **Wrong**:
```opal
statsby count(), group_by(resource_attributes."k8s.namespace.name")
# Error: Can't group by nested field directly
```

✅ **Correct**:
```opal
make_col namespace:string(resource_attributes."k8s.namespace.name")
| statsby count(), group_by(namespace)
```

**Why**: `statsby` group_by needs simple column names. Use `make_col` to extract nested fields first.

### Pitfall 2: Using align Instead of statsby

❌ **Wrong**:
```opal
align options(bins: 1), count:count()
aggregate total:sum(count)
# align is for METRICS only!
```

✅ **Correct**:
```opal
statsby count()
# statsby is for EVENTS
```

**Why**: `align` is only for metric datasets. Events use `statsby` for aggregation.

### Pitfall 3: Using limit Instead of topk After Aggregation

❌ **Wrong** (less clear):
```opal
statsby error_count:count(), group_by(namespace)
| sort desc(error_count)
| limit 10
```

✅ **Correct**:
```opal
statsby error_count:count(), group_by(namespace)
| topk 10, max(error_count)
```

**Why**: `topk` explicitly states "top N by this metric" - clearer intent than arbitrary limit.

### Pitfall 4: Confusing statsby with timechart

❌ **Wrong** (if you want summary):
```opal
timechart 1h, count(), group_by(namespace)
# Returns multiple rows per namespace (time-series)
```

✅ **Correct** (for summary):
```opal
statsby count(), group_by(namespace)
# Returns one row per namespace (total)
```

**Why**:
- `statsby` = Single summary across time range
- `timechart` = Time-series with multiple rows per group

## Tips and Best Practices

- **Name your aggregations**: Use descriptive names like `error_count:count()` instead of just `count()`
- **Filter before aggregating**: Apply filters before `statsby` for better performance
- **Use topk for top N**: More explicit than sort/limit
- **Type conversion**: Wrap nested fields in `string()` for safety
- **Test with limit first**: When developing, filter to small dataset before aggregating
- **Small time ranges**: Start with 1h or 24h, expand once query is working

## Aggregation Function Reference

**Counting**:
- `count()` - Count all events in group

**Numeric**:
- `sum(field)` - Sum values
- `avg(field)` - Average
- `min(field)` - Minimum
- `max(field)` - Maximum
- `percentile(field, p)` - Percentile (0.0 to 1.0)

**String/Any**:
- `any_not_null(field)` - Any non-null value from group

## Additional Resources

For more details, see:
- [RESEARCH.md](../../RESEARCH.md) - Tested patterns and findings
- [OPAL Documentation](https://docs.observeinc.com/en/latest/content/query-language-reference/) - Official OPAL docs

## Related Skills

- [filtering-event-datasets] - For filtering events before aggregation
- [time-series-analysis] - For time-series trends with timechart
- [working-with-nested-fields] - Deep dive on nested field access

---

**Last Updated**: November 14, 2025
**Version**: 1.0
**Tested With**: Observe OPAL v2.x