--- name: window-function-builder description: Эксперт SQL window functions. Используй для аналитических запросов, OVER clauses и ranking functions. --- # SQL Window Function Builder Эксперт по SQL window functions и аналитическим запросам для бизнес-аналитики. ## Core Function Categories ### Function Overview ```yaml window_functions: ranking: - ROW_NUMBER(): "Unique sequential numbers" - RANK(): "Ranking with gaps for ties" - DENSE_RANK(): "Ranking without gaps" - NTILE(n): "Divide into n buckets" aggregate: - SUM(): "Running/cumulative totals" - AVG(): "Moving averages" - COUNT(): "Running counts" - MIN(): "Running minimum" - MAX(): "Running maximum" offset: - LAG(col, n): "Access previous row" - LEAD(col, n): "Access next row" - FIRST_VALUE(): "First value in frame" - LAST_VALUE(): "Last value in frame" - NTH_VALUE(col, n): "Nth value in frame" statistical: - PERCENT_RANK(): "Relative rank as percentage" - CUME_DIST(): "Cumulative distribution" - PERCENTILE_CONT(): "Interpolated percentile" - PERCENTILE_DISC(): "Discrete percentile" ``` --- ## Basic Syntax ### OVER Clause Structure ```sql function_name(expression) OVER ( [PARTITION BY partition_expression, ...] [ORDER BY sort_expression [ASC|DESC], ...] [frame_clause] ) ``` ### Frame Specifications ```sql -- Frame clause syntax ROWS | RANGE BETWEEN frame_start AND frame_end -- Frame bounds UNBOUNDED PRECEDING -- From first row of partition n PRECEDING -- n rows before current CURRENT ROW -- Current row n FOLLOWING -- n rows after current UNBOUNDED FOLLOWING -- To last row of partition -- Common frame patterns ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- Running total (default for ORDER BY) ROWS BETWEEN 6 PRECEDING AND CURRENT ROW -- 7-day rolling ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING -- 3-row centered RANGE BETWEEN INTERVAL '30' DAY PRECEDING AND CURRENT ROW -- 30-day range ``` --- ## Ranking Functions ### ROW_NUMBER, RANK, DENSE_RANK ```sql -- Basic ranking comparison SELECT employee_id, department, salary, ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num, RANK() OVER (ORDER BY salary DESC) AS rank, DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank FROM employees; -- Results with tied salaries: -- | employee_id | salary | row_num | rank | dense_rank | -- |-------------|--------|---------|------|------------| -- | 101 | 100000 | 1 | 1 | 1 | -- | 102 | 100000 | 2 | 1 | 1 | <- same salary -- | 103 | 90000 | 3 | 3 | 2 | <- note rank skips 2 ``` ### Ranking Within Groups ```sql -- Top 3 salaries per department WITH ranked AS ( SELECT employee_id, employee_name, department, salary, DENSE_RANK() OVER ( PARTITION BY department ORDER BY salary DESC ) AS dept_rank FROM employees ) SELECT * FROM ranked WHERE dept_rank <= 3; ``` ### NTILE for Percentiles ```sql -- Divide customers into quartiles by revenue SELECT customer_id, customer_name, total_revenue, NTILE(4) OVER (ORDER BY total_revenue DESC) AS revenue_quartile, CASE NTILE(4) OVER (ORDER BY total_revenue DESC) WHEN 1 THEN 'Top 25%' WHEN 2 THEN '25-50%' WHEN 3 THEN '50-75%' WHEN 4 THEN 'Bottom 25%' END AS segment FROM customer_revenue; ``` --- ## Running Calculations ### Running Totals ```sql -- Running total of sales SELECT order_date, order_amount, SUM(order_amount) OVER ( ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS running_total FROM orders; -- Running total per customer SELECT customer_id, order_date, order_amount, SUM(order_amount) OVER ( PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS customer_running_total FROM orders; ``` ### Moving Averages ```sql -- 7-day moving average SELECT date, daily_revenue, AVG(daily_revenue) OVER ( ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) AS moving_avg_7d, -- Also track the count for partial windows COUNT(*) OVER ( ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) AS days_in_window FROM daily_sales; -- Centered moving average (3 days before, current, 3 days after) SELECT date, daily_revenue, AVG(daily_revenue) OVER ( ORDER BY date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING ) AS centered_avg_7d FROM daily_sales; ``` ### Cumulative Percentage ```sql -- Cumulative percentage of total SELECT product_category, revenue, SUM(revenue) OVER (ORDER BY revenue DESC) AS cumulative_revenue, ROUND( 100.0 * SUM(revenue) OVER (ORDER BY revenue DESC) / SUM(revenue) OVER (), 2 ) AS cumulative_pct FROM category_sales ORDER BY revenue DESC; ``` --- ## LAG and LEAD ### Period-over-Period Comparison ```sql -- Month-over-month growth SELECT month, revenue, LAG(revenue, 1) OVER (ORDER BY month) AS prev_month_revenue, revenue - LAG(revenue, 1) OVER (ORDER BY month) AS mom_change, ROUND( 100.0 * (revenue - LAG(revenue, 1) OVER (ORDER BY month)) / NULLIF(LAG(revenue, 1) OVER (ORDER BY month), 0), 2 ) AS mom_growth_pct FROM monthly_revenue; -- Year-over-year comparison SELECT date, revenue, LAG(revenue, 365) OVER (ORDER BY date) AS yoy_revenue, revenue - LAG(revenue, 365) OVER (ORDER BY date) AS yoy_change FROM daily_revenue; ``` ### Gap Analysis ```sql -- Find gaps between orders SELECT customer_id, order_date, LAG(order_date) OVER ( PARTITION BY customer_id ORDER BY order_date ) AS previous_order_date, order_date - LAG(order_date) OVER ( PARTITION BY customer_id ORDER BY order_date ) AS days_since_last_order FROM orders; -- Identify churned customers (no order > 90 days) WITH order_gaps AS ( SELECT customer_id, order_date, LEAD(order_date) OVER ( PARTITION BY customer_id ORDER BY order_date ) AS next_order_date, LEAD(order_date) OVER ( PARTITION BY customer_id ORDER BY order_date ) - order_date AS gap_days FROM orders ) SELECT DISTINCT customer_id FROM order_gaps WHERE gap_days > 90 OR (next_order_date IS NULL AND order_date < CURRENT_DATE - INTERVAL '90 days'); ``` ### Lead for Future Values ```sql -- Forecast vs actual comparison SELECT forecast_date, predicted_value, LEAD(actual_value, 7) OVER (ORDER BY forecast_date) AS actual_7d_later, LEAD(actual_value, 7) OVER (ORDER BY forecast_date) - predicted_value AS forecast_error FROM forecasts; ``` --- ## FIRST_VALUE and LAST_VALUE ### First/Last in Group ```sql -- First and last order per customer SELECT customer_id, order_id, order_date, order_amount, FIRST_VALUE(order_date) OVER ( PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS first_order_date, LAST_VALUE(order_date) OVER ( PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS last_order_date FROM orders; -- Note: LAST_VALUE requires explicit frame to see all rows! ``` ### Baseline Comparison ```sql -- Compare each day to first day of month SELECT date, revenue, FIRST_VALUE(revenue) OVER ( PARTITION BY DATE_TRUNC('month', date) ORDER BY date ) AS first_day_revenue, revenue - FIRST_VALUE(revenue) OVER ( PARTITION BY DATE_TRUNC('month', date) ORDER BY date ) AS diff_from_first_day FROM daily_revenue; ``` --- ## Statistical Functions ### Percentile Calculations ```sql -- Calculate percentiles SELECT product_id, price, PERCENT_RANK() OVER (ORDER BY price) AS percent_rank, CUME_DIST() OVER (ORDER BY price) AS cumulative_dist, NTILE(100) OVER (ORDER BY price) AS percentile FROM products; -- Median calculation (50th percentile) SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary, PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary_discrete, AVG(salary) AS mean_salary FROM employees; -- Multiple percentiles at once SELECT department, PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) AS p25, PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY salary) AS median, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) AS p75, PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY salary) AS p90 FROM employees GROUP BY department; ``` --- ## Business Analytics Patterns ### Customer Lifecycle Analysis ```sql -- Customer order sequence and lifecycle metrics WITH customer_orders AS ( SELECT customer_id, order_id, order_date, order_amount, ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY order_date ) AS order_number, FIRST_VALUE(order_date) OVER ( PARTITION BY customer_id ORDER BY order_date ) AS first_order_date, SUM(order_amount) OVER ( PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS lifetime_value, LAG(order_date) OVER ( PARTITION BY customer_id ORDER BY order_date ) AS previous_order_date FROM orders ) SELECT customer_id, order_id, order_number, order_amount, lifetime_value, order_date - first_order_date AS days_since_first_order, order_date - previous_order_date AS days_since_last_order, CASE WHEN order_number = 1 THEN 'New' WHEN order_date - previous_order_date > 90 THEN 'Reactivated' ELSE 'Repeat' END AS customer_status FROM customer_orders; ``` ### Cohort Analysis ```sql -- Monthly cohort retention WITH user_cohorts AS ( SELECT user_id, DATE_TRUNC('month', first_activity_date) AS cohort_month, DATE_TRUNC('month', activity_date) AS activity_month FROM user_activity ), cohort_sizes AS ( SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size FROM user_cohorts GROUP BY cohort_month ), monthly_activity AS ( SELECT cohort_month, activity_month, COUNT(DISTINCT user_id) AS active_users, EXTRACT(MONTH FROM AGE(activity_month, cohort_month)) AS month_number FROM user_cohorts GROUP BY cohort_month, activity_month ) SELECT ma.cohort_month, ma.month_number, cs.cohort_size, ma.active_users, ROUND(100.0 * ma.active_users / cs.cohort_size, 2) AS retention_rate FROM monthly_activity ma JOIN cohort_sizes cs ON ma.cohort_month = cs.cohort_month ORDER BY ma.cohort_month, ma.month_number; ``` ### Sales Performance ```sql -- Sales rep performance with rankings and targets WITH sales_performance AS ( SELECT sales_rep_id, rep_name, region, SUM(deal_amount) AS total_sales, COUNT(*) AS deal_count, AVG(deal_amount) AS avg_deal_size FROM sales_deals WHERE close_date >= DATE_TRUNC('quarter', CURRENT_DATE) GROUP BY sales_rep_id, rep_name, region ) SELECT sales_rep_id, rep_name, region, total_sales, deal_count, avg_deal_size, RANK() OVER (ORDER BY total_sales DESC) AS overall_rank, RANK() OVER (PARTITION BY region ORDER BY total_sales DESC) AS region_rank, total_sales - AVG(total_sales) OVER () AS vs_company_avg, total_sales - AVG(total_sales) OVER (PARTITION BY region) AS vs_region_avg, PERCENT_RANK() OVER (ORDER BY total_sales) AS percentile FROM sales_performance; ``` ### Inventory Analysis ```sql -- Inventory movement analysis SELECT product_id, transaction_date, transaction_type, quantity, SUM( CASE WHEN transaction_type = 'IN' THEN quantity WHEN transaction_type = 'OUT' THEN -quantity ELSE 0 END ) OVER ( PARTITION BY product_id ORDER BY transaction_date, transaction_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS running_inventory, AVG( CASE WHEN transaction_type = 'OUT' THEN quantity END ) OVER ( PARTITION BY product_id ORDER BY transaction_date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW ) AS avg_daily_usage_30d FROM inventory_transactions; ``` --- ## Platform-Specific Features ### PostgreSQL ```sql -- FILTER clause with window functions SELECT date, category, amount, SUM(amount) OVER (ORDER BY date) AS total_running, SUM(amount) FILTER (WHERE category = 'A') OVER (ORDER BY date) AS category_a_running FROM transactions; -- GROUPS frame type SELECT date, amount, SUM(amount) OVER ( ORDER BY date GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING ) AS sum_adjacent_groups FROM transactions; ``` ### SQL Server ```sql -- STRING_AGG with OVER (SQL Server 2017+) -- Not directly supported, use workaround: SELECT DISTINCT department, STRING_AGG(employee_name, ', ') WITHIN GROUP (ORDER BY employee_name) OVER (PARTITION BY department) AS employees FROM employees; -- OFFSET with frames (SQL Server 2022+) SELECT date, value, AVG(value) OVER ( ORDER BY date ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING ) AS previous_week_avg FROM daily_metrics; ``` ### BigQuery ```sql -- QUALIFY clause (filter on window function results) SELECT customer_id, order_date, order_amount FROM orders QUALIFY ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY order_date DESC ) = 1; -- Equivalent to CTE approach: -- WITH ranked AS ( -- SELECT *, ROW_NUMBER() OVER (...) AS rn FROM orders -- ) -- SELECT * FROM ranked WHERE rn = 1; ``` ### Snowflake ```sql -- QUALIFY clause SELECT * FROM sales QUALIFY DENSE_RANK() OVER ( PARTITION BY region ORDER BY revenue DESC ) <= 5; -- CONDITIONAL_TRUE_EVENT for sessionization SELECT user_id, event_time, CONDITIONAL_TRUE_EVENT( DATEDIFF('minute', LAG(event_time) OVER ( PARTITION BY user_id ORDER BY event_time ), event_time) > 30 ) OVER ( PARTITION BY user_id ORDER BY event_time ) AS session_id FROM user_events; ``` --- ## Performance Optimization ### Indexing Strategy ```sql -- Composite index for window function -- Index columns: PARTITION BY columns first, then ORDER BY columns CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date); -- For running totals across all data CREATE INDEX idx_orders_date_amount ON orders (order_date, order_amount); ``` ### Query Optimization Tips ```yaml optimization_tips: - tip: "Add indexes on PARTITION BY + ORDER BY columns" reason: "Reduces sort operations" - tip: "Use CTEs to compute window once, reference multiple times" reason: "Avoids redundant calculations" - tip: "Limit rows before applying window functions when possible" reason: "Smaller dataset = faster windows" - tip: "Be explicit about frame clause" reason: "Prevents unexpected defaults" - tip: "Use QUALIFY when supported" reason: "Cleaner and often faster than subquery" ``` ### CTE Pattern for Clarity ```sql -- Use CTEs for complex window queries WITH -- Step 1: Calculate raw metrics daily_metrics AS ( SELECT date, SUM(amount) AS daily_total FROM transactions GROUP BY date ), -- Step 2: Add window calculations with_windows AS ( SELECT date, daily_total, SUM(daily_total) OVER (ORDER BY date) AS running_total, AVG(daily_total) OVER ( ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) AS rolling_7d_avg, LAG(daily_total, 7) OVER (ORDER BY date) AS same_day_last_week FROM daily_metrics ) -- Step 3: Final calculations SELECT date, daily_total, running_total, rolling_7d_avg, daily_total - same_day_last_week AS wow_change FROM with_windows WHERE date >= CURRENT_DATE - INTERVAL '30 days'; ``` --- ## Лучшие практики 1. **Явно указывай frame clause** — дефолтное поведение может быть неожиданным 2. **Используй CTE для сложных запросов** — улучшает читаемость и maintainability 3. **Создавай composite индексы** — на PARTITION BY + ORDER BY колонки 4. **Проверяй NULL handling** — LAG/LEAD возвращают NULL для граничных строк 5. **Тестируй edge cases** — первая/последняя строка, одна строка в партиции 6. **Используй QUALIFY где поддерживается** — чище чем subquery 7. **Учитывай особенности СУБД** — синтаксис и функции различаются 8. **Оптимизируй сначала WHERE** — фильтруй до применения window functions