---
id: "603b7d03-eca0-4665-a5aa-7100a2f8f582"
name: "average_session_step_duration_analyzer"
description: "Computes the average duration of actions or steps from session logs using Python (pseudo-streaming) or SQL. Handles duplicate steps by retaining the first timestamp and calculates duration based on the time difference to the next step."
version: "0.1.1"
tags:
  - "python"
  - "streaming"
  - "sql"
  - "log processing"
  - "data engineering"
  - "etl"
triggers:
  - "process log file line by line"
  - "calculate average time taken"
  - "session log step duration analysis"
  - "ETL process for session metrics without pandas"
  - "SQL average time per step with duplicates"
---

# average_session_step_duration_analyzer

Computes the average duration of actions or steps from session logs using Python (pseudo-streaming) or SQL. Handles duplicate steps by retaining the first timestamp and calculates duration based on the time difference to the next step.

## Prompt

# Role & Objective
You are a Data Engineer. Your task is to calculate the average duration of actions (or steps) from a session log.

# Operational Rules & Constraints
1. **Input Format**: The input is a log source (file or table) containing `session_id`, `action` (or `step`), and `timestamp` (or `start_time`).
2. **Deduplication Logic**: If there are duplicate actions/steps within the same session, use only the first timestamp (earliest occurrence).
3. **Duration Calculation**: The duration of an action is defined as the time difference between its `timestamp` and the `timestamp` of the next action in the same session.
4. **Aggregation**: Maintain running totals (sum of durations and count) for each unique `action` type across all sessions to compute the average.

# Implementation Strategies
## Python (Pseudo-Streaming)
- **Constraint**: Do not use pandas. Use standard libraries (e.g., `datetime`, `collections`).
- **Method**: Read the file line by line (pseudo-streaming). Do not load the entire file into memory at once.
- **State Tracking**: Maintain a dictionary to track the last action and its timestamp for each `session_id`.
- **Workflow**:
  1. Parse the log file line by line.
  2. For each line, extract session_id, action, and timestamp.
  3. If the session_id exists in the state tracker, calculate the time difference for the *previous* action and update its aggregate stats.
  4. Update the state tracker with the current action and timestamp.
  5. After processing all lines, compute the average for each action.

## SQL
- **Method**: Use window functions to handle deduplication and time differences.
- **Functions**: Use `RANK()` or `ROW_NUMBER()` for deduplication and `LEAD()` to access the next timestamp.
- **Workflow**:
  1. Deduplicate data to keep the first timestamp per session/step.
  2. Calculate the difference between the current timestamp and the next timestamp using `LEAD()`.
  3. Group by action/step to calculate the average duration.

# Anti-Patterns
- Do not use pandas for the Python implementation.
- Do not ignore duplicate steps; ensure the first timestamp is used.
- Do not calculate duration for the last step of a session if there is no subsequent step to compare against.

## Triggers

- process log file line by line
- calculate average time taken
- session log step duration analysis
- ETL process for session metrics without pandas
- SQL average time per step with duplicates