# Platform & Infrastructure Improvement Plan ## B2B Analytics SaaS — Series B, 50 Engineers --- ## 1. Shared Capabilities Plan ### 1.1 Problem Statement Multiple feature teams are independently rebuilding export functionality, filtering logic, and permission checks. This results in: - **Duplicated effort**: Engineering hours wasted re-implementing solved problems - **Inconsistent behavior**: Users experience different filtering semantics, export formats, and permission enforcement across features - **Maintenance burden**: Bug fixes and improvements must be applied in multiple places - **Security risk**: Permission checks implemented ad hoc increase the surface area for authorization bugs ### 1.2 Shared Capability Inventory | Capability | Current State | Estimated Duplication | |---|---|---| | Export (CSV, Excel, PDF) | Rebuilt per feature team | 4-6 implementations | | Filtering & query building | Custom per feature | 5+ implementations | | Permission / authorization checks | Inline per endpoint | Scattered across codebase | | Pagination | Mixed cursor/offset | 3-4 implementations | | Audit logging | Partial coverage | Inconsistent | | Rate limiting | Ad hoc or missing | Varies | ### 1.3 Recommended Shared Services Architecture #### A. Export Service - **Scope**: Centralized async export pipeline supporting CSV, Excel, PDF, and scheduled exports - **Design**: - Accept a standardized export request (data source, filters, format, recipient) - Use a job queue (e.g., Redis-backed Sidekiq, or a dedicated task queue) for async processing - Store generated files in object storage (S3/GCS) with signed URLs - Notify users via webhook/email on completion - **API Contract**: ``` POST /api/v1/exports { "source": "dashboard_metrics", "filters": { ... }, "format": "csv", "notify": "email" } ``` - **Benefits**: Uniform format support, centralized size limits, progress tracking, retry logic #### B. Filtering & Query Engine - **Scope**: Shared filter definition, parsing, validation, and SQL generation - **Design**: - Define a filter DSL (JSON-based) that all feature teams use - Central library validates filter syntax, prevents injection, applies tenant scoping - Generates parameterized SQL or query builder calls - Supports common operations: equality, range, contains, in-list, date ranges, null checks - **Filter Schema Example**: ```json { "filters": [ { "field": "created_at", "op": "gte", "value": "2026-01-01" }, { "field": "status", "op": "in", "value": ["active", "trial"] } ], "sort": { "field": "created_at", "dir": "desc" } } ``` - **Benefits**: Consistent UX, single place for query optimization, SQL injection prevention #### C. Authorization Service - **Scope**: Centralized permission evaluation replacing inline checks - **Design**: - Implement a policy engine (consider OPA/Rego, Casbin, or custom RBAC/ABAC service) - Define resources, actions, and roles in a central policy store - Expose a lightweight SDK that feature teams call: `authz.check(user, action, resource)` - Cache permission evaluations with short TTLs for hot paths - Emit audit events for every authorization decision - **Migration Path**: 1. Catalog all existing permission checks across the codebase 2. Map them to a unified role/permission model 3. Implement the central service with a compatibility shim 4. Migrate teams incrementally (one feature area per sprint) 5. Remove legacy inline checks after validation - **Benefits**: Single source of truth for access control, auditable, easier compliance (SOC 2, etc.) ### 1.4 Platform Team Structure - **Recommendation**: Form a dedicated Platform Engineering team (3-5 engineers initially) - **Charter**: Own shared capabilities, define APIs, provide SDKs, maintain SLOs for platform services - **Operating Model**: - Platform team builds and maintains core services - Feature teams consume via well-documented SDKs and APIs - Platform team holds office hours and reviews integration PRs - Shared capabilities are versioned; breaking changes follow a deprecation policy ### 1.5 Migration Strategy | Phase | Duration | Focus | |---|---|---| | Phase 1 | Weeks 1-4 | Catalog existing implementations, define unified APIs | | Phase 2 | Weeks 5-10 | Build Export Service and Filter Engine v1 | | Phase 3 | Weeks 8-14 | Build Authorization Service v1 (overlapping) | | Phase 4 | Weeks 12-20 | Migrate feature teams (1-2 teams per sprint) | | Phase 5 | Weeks 20-24 | Deprecate and remove legacy implementations | --- ## 2. Postgres Scaling Analysis ### 2.1 Current State Assessment | Metric | Current | Concern Level | |---|---|---| | Database size | 500 GB | Moderate — approaching practical limits for single-node | | Query latency | Increasing | High — user-facing impact | | Expected growth | 5x in 6 months | Critical — 2.5 TB without intervention | | Connection count | Likely growing with team/feature count | Moderate | | Write volume | Unknown but likely growing with analytics ingestion | High | ### 2.2 Immediate Optimizations (Weeks 1-4) These require no architectural changes and can deliver quick wins: #### A. Query Performance Audit - Run `pg_stat_statements` to identify the top 20 slowest and most frequent queries - Use `EXPLAIN ANALYZE` on each; look for sequential scans, missing indexes, and poor join plans - Add targeted indexes (partial indexes, covering indexes, expression indexes as needed) - Identify and fix N+1 query patterns in application code #### B. Connection Pooling - Deploy PgBouncer in transaction mode if not already present - Target: reduce active connections from potentially hundreds to a managed pool (50-100 server connections) - This alone often improves p99 latency significantly #### C. Configuration Tuning - Review and optimize: `shared_buffers` (25% of RAM), `effective_cache_size` (75% of RAM), `work_mem`, `maintenance_work_mem` - Enable `pg_stat_monitor` for detailed query analytics - Set `random_page_cost` appropriately for SSD storage (1.1-1.5) #### D. Table Maintenance - Identify bloated tables with `pgstattuple`; run targeted `VACUUM FULL` during maintenance windows - Check for unused indexes consuming write overhead - Add table partitioning to the largest tables (likely time-series analytics data) using native declarative partitioning by date range ### 2.3 Medium-Term Scaling (Weeks 4-12) #### A. Read Replicas - Deploy 1-2 read replicas for analytics/reporting queries - Route read traffic at the application level (or use a proxy like PgPool-II) - Target: offload 60-80% of read traffic from primary - Acceptable replication lag: <1 second for most analytics use cases #### B. Table Partitioning - Partition the largest tables (analytics events, audit logs, time-series metrics) by month or week - Benefits: faster queries with partition pruning, easier data lifecycle management, parallel query execution - Implementation: ```sql CREATE TABLE events ( id bigint GENERATED ALWAYS AS IDENTITY, tenant_id uuid NOT NULL, created_at timestamptz NOT NULL, ... ) PARTITION BY RANGE (created_at); CREATE TABLE events_2026_q1 PARTITION OF events FOR VALUES FROM ('2026-01-01') TO ('2026-04-01'); ``` #### C. Data Archival Strategy - Define retention policies: hot data (0-90 days in primary), warm data (90-365 days in read replica or cheaper storage), cold data (>1 year in object storage/Parquet) - Implement automated archival jobs that move old partitions - Target: keep primary database under 500 GB even with 5x growth #### D. Caching Layer - Add Redis/Memcached for frequently-read, slowly-changing data: - Permission/role lookups (TTL: 60s) - Dashboard metadata and configuration (TTL: 300s) - Aggregated analytics results (TTL: varies by freshness requirement) - Target: 40-60% cache hit rate for read queries, reducing DB load proportionally ### 2.4 Long-Term Architecture (Months 3-6) #### A. Analytical Workload Separation (OLAP Split) - Move heavy analytical queries to a dedicated analytical store - Options: - **ClickHouse**: Excellent for time-series analytics, columnar storage, fast aggregations - **TimescaleDB**: If staying in the Postgres ecosystem is important - **BigQuery/Redshift**: If preferring managed services - Stream data from Postgres via CDC (Debezium/Kafka Connect) to the analytical store - Feature teams query the analytical store for dashboards and reports; Postgres handles transactional workloads #### B. Sharding Evaluation - If single-tenant queries dominate, consider tenant-based sharding using Citus or application-level sharding - For a B2B SaaS with analytics, tenant-based sharding is natural since queries rarely cross tenant boundaries - **Citus** (Postgres extension) can shard by `tenant_id` with minimal application changes - Decision point: pursue sharding only if the OLAP split + read replicas + partitioning are insufficient #### C. Connection Architecture - Evaluate upgrading to a service mesh or sidecar proxy pattern for database connections - Consider Postgres logical replication for zero-downtime migrations ### 2.5 Projected Capacity Timeline | Milestone | Database Size | Strategy | Target Latency (p95) | |---|---|---|---| | Today | 500 GB | Single node | Degrading | | Month 1 | 550 GB | Optimized queries + pooling | <200ms | | Month 3 | 700 GB | Read replicas + partitioning + caching | <150ms | | Month 6 | 800 GB primary + archive | OLAP split + archival | <100ms transactional, <2s analytical | | Month 6 (without action) | ~2.5 TB | Unscalable | >1s, outages likely | --- ## 3. Reliability SLOs ### 3.1 SLO Framework We recommend defining SLOs across four dimensions: availability, latency, correctness, and freshness. ### 3.2 Proposed SLOs #### Tier 1 — User-Facing API (Dashboard, Core Analytics) | SLI | SLO Target | Measurement | |---|---|---| | Availability | 99.9% (8.7 hours downtime/year) | Successful responses / total requests (exclude 4xx) | | Latency (p50) | < 100ms | Server-side request duration | | Latency (p95) | < 500ms | Server-side request duration | | Latency (p99) | < 1,500ms | Server-side request duration | | Error rate | < 0.1% | 5xx responses / total responses | #### Tier 2 — Export & Batch Processing | SLI | SLO Target | Measurement | |---|---|---| | Availability | 99.5% | Job processing success rate | | Completion time (p95) | < 5 minutes for exports under 100K rows | Job start to file available | | Completion time (p95) | < 30 minutes for exports under 1M rows | Job start to file available | | Data correctness | 100% | Export matches query results at time of request | #### Tier 3 — Internal Platform Services (AuthZ, Filtering) | SLI | SLO Target | Measurement | |---|---|---| | Availability | 99.95% | Higher than Tier 1 since they are dependencies | | Latency (p99) | < 50ms | Authorization check duration | | Correctness | 100% | No false-positive authorizations ever | | Cache hit rate | > 80% | For permission lookups | #### Tier 4 — Database & Infrastructure | SLI | SLO Target | Measurement | |---|---|---| | Database availability | 99.95% | Connection success rate | | Replication lag | < 5 seconds (p99) | Replica lag metric | | Query latency (p95) | < 200ms for transactional queries | pg_stat_statements | | Backup RPO | < 1 hour | Time since last successful backup | | Backup RTO | < 4 hours | Time to restore from backup | ### 3.3 Error Budget Policy - **Budget**: With 99.9% availability SLO, the error budget is 0.1% per rolling 30-day window (~43 minutes/month) - **When budget is healthy (>50% remaining)**: Normal feature velocity, deploy at will - **When budget is stressed (10-50% remaining)**: Require additional review for risky deploys, increase monitoring - **When budget is exhausted (<10% remaining)**: Freeze non-critical deploys, all engineering effort focuses on reliability until budget recovers - **Review cadence**: Weekly SLO review in engineering standup, monthly SLO review with leadership ### 3.4 Observability Requirements To measure these SLOs, implement: 1. **Metrics**: Prometheus/Datadog for request latency histograms, error rates, saturation 2. **Logging**: Structured JSON logs with request IDs, tenant IDs, and latency breakdowns 3. **Tracing**: Distributed tracing (Jaeger/Datadog APM) across all services, especially for the shared capability calls 4. **Dashboards**: One SLO dashboard per tier, visible to all engineers 5. **Alerting**: - Page (PagerDuty): >1% error rate sustained for 5 minutes, database primary down - Warn (Slack): SLO burn rate exceeding budget, replication lag >10s, export queue depth >100 6. **SLO Tracking Tool**: Consider Nobl9, Sloth, or a custom Prometheus recording rules setup --- ## 4. Execution Roadmap ### 4.1 Overview The plan spans 6 months, divided into three phases. Each phase delivers independent value while building toward the full vision. ``` Month 1-2: STABILIZE — Quick wins, stop the bleeding Month 3-4: PLATFORM — Build shared capabilities, scale reads Month 5-6: SCALE — Analytical workload separation, full migration ``` ### 4.2 Phase 1: Stabilize (Weeks 1-8) **Goal**: Reduce immediate pain, establish foundations | Week | Workstream | Deliverable | Owner | Dependencies | |---|---|---|---|---| | 1-2 | DB Performance | Query audit report; top 10 queries optimized | 1 Senior SRE + 1 Backend | pg_stat_statements access | | 1-2 | Observability | SLO dashboards for Tier 1 and Tier 4 | 1 SRE | Metrics pipeline | | 1-3 | Connection Pooling | PgBouncer deployed and configured | 1 SRE | Staging environment | | 2-4 | Shared Capabilities | Catalog of all existing export/filter/authz implementations | Platform lead + feature team leads | Codebase access, interviews | | 3-4 | DB Performance | Missing indexes added; bloated tables vacuumed | 1 SRE | Maintenance window | | 3-6 | Read Replicas | 2 read replicas provisioned; read routing implemented | 1 SRE + 1 Backend | Infrastructure provisioning | | 4-8 | Platform Team | Platform team formed (3 engineers); charter defined | Engineering leadership | Hiring/reallocation | | 5-8 | API Design | Shared Export, Filter, and AuthZ API specs finalized (RFC process) | Platform team | Stakeholder review | **Phase 1 Success Criteria**: - p95 API latency reduced by 30% - SLO dashboards operational - Platform team staffed and chartered - API specs approved by feature team leads ### 4.3 Phase 2: Platform Build (Weeks 9-16) **Goal**: Deliver shared capabilities v1, partition database, begin migrations | Week | Workstream | Deliverable | Owner | Dependencies | |---|---|---|---|---| | 9-12 | Export Service | Export Service v1 (CSV, Excel) with async processing | Platform team (2 eng) | Object storage, job queue | | 9-12 | Filter Engine | Shared filter library v1 with SDK | Platform team (1 eng) | API spec | | 9-14 | AuthZ Service | Authorization service v1 with policy engine | Platform team (2 eng) + 1 Security | Policy model | | 10-12 | DB Partitioning | Top 3 largest tables partitioned by date range | 1 SRE + 1 Backend | Maintenance window, testing | | 11-13 | Caching | Redis caching layer for permission lookups and dashboard metadata | 1 Backend | Redis infrastructure | | 12-14 | Migration: Team 1 | First feature team migrated to shared Export + Filter | Platform team + Feature team 1 | Export Service v1, Filter v1 | | 14-16 | Migration: Team 2-3 | Two more feature teams migrated | Platform team support | v1 services stable | **Phase 2 Success Criteria**: - Export Service handling 100% of Team 1's exports - Filter Engine SDK adopted by 3 teams - Authorization service evaluating permissions for at least 1 feature area - Database primary under 500 GB with partitioning + archival active - p95 latency < 200ms ### 4.4 Phase 3: Scale (Weeks 17-24) **Goal**: Handle 5x traffic, complete migrations, operational maturity | Week | Workstream | Deliverable | Owner | Dependencies | |---|---|---|---|---| | 17-19 | OLAP Split | Analytical store deployed (ClickHouse or TimescaleDB); CDC pipeline operational | 1 SRE + 1 Data eng | Kafka/CDC infrastructure | | 17-20 | Migration: All Teams | Remaining feature teams migrated to shared capabilities | Platform team + all feature teams | Stable v1 services | | 18-20 | Dashboard Queries | Analytical dashboard queries routed to OLAP store | Backend + Platform | OLAP store populated | | 19-22 | Data Archival | Automated archival pipeline moving cold data to object storage | 1 SRE | Archival policy approved | | 20-22 | Load Testing | Comprehensive load test at 5x current traffic | SRE team | Staging environment at scale | | 21-23 | Legacy Cleanup | Remove deprecated export/filter/authz implementations | Feature teams | All migrations complete | | 22-24 | Operational Runbooks | Runbooks for all platform services; on-call rotation established | Platform + SRE | Service stability | | 24 | SLO Review | Formal SLO review; adjust targets based on 3 months of data | Engineering leadership | SLO data | **Phase 3 Success Criteria**: - System handles 5x current traffic in load tests with SLOs met - All feature teams consuming shared capabilities (zero legacy export/filter/authz code) - OLAP store handling 100% of analytical queries - Database primary stable at <600 GB with archival active - Error budget healthy across all tiers - On-call rotation staffed and runbooks complete ### 4.5 Staffing Summary | Role | Phase 1 | Phase 2 | Phase 3 | Notes | |---|---|---|---|---| | Platform Engineers | 0 → 3 | 3-4 | 3-4 | Core platform team | | SRE | 2 | 2 | 2 | DB scaling, infrastructure | | Backend (borrowed from feature teams) | 1-2 | 1 | 0 | Migration support | | Data Engineer | 0 | 0 | 1 | OLAP pipeline | | **Total dedicated** | **3-5** | **6-7** | **6-7** | ~12-14% of 50 engineers | ### 4.6 Risk Register | Risk | Likelihood | Impact | Mitigation | |---|---|---|---| | Database outage before scaling completes | Medium | Critical | Prioritize read replicas and connection pooling in Phase 1 | | Feature teams resist adopting shared capabilities | Medium | High | Executive sponsorship; demonstrate time savings; make migration easy with SDKs | | Platform team understaffed due to competing priorities | High | High | Dedicated headcount commitment from leadership; protect from feature work | | Shared service becomes bottleneck/single point of failure | Medium | High | Design for high availability from day 1; circuit breakers; graceful degradation | | OLAP migration introduces data inconsistencies | Medium | Medium | Dual-write validation period; automated reconciliation checks | | 5x traffic arrives sooner than expected | Low | Critical | Phase 1 quick wins provide 2-3x headroom; can accelerate Phase 2 | ### 4.7 Key Decision Points 1. **Week 4**: Finalize analytical store choice (ClickHouse vs. TimescaleDB vs. managed service) 2. **Week 8**: Decide on authorization model (RBAC vs. ABAC vs. hybrid) based on catalog findings 3. **Week 12**: Evaluate whether Citus sharding is needed based on partitioning + OLAP split results 4. **Week 16**: Go/no-go on Phase 3 timeline based on Phase 2 migration velocity 5. **Week 20**: Assess whether SLO targets need adjustment based on real-world data --- ## 5. Summary of Expected Outcomes | Metric | Today | After 6 Months | |---|---|---| | Duplicate export implementations | 4-6 | 1 (shared service) | | Duplicate filter implementations | 5+ | 1 (shared library) | | Permission check consistency | Ad hoc, inconsistent | Centralized, auditable | | Database size (primary) | 500 GB, growing | <600 GB (with archival) | | API latency (p95) | Degrading (>500ms) | <200ms transactional | | Traffic capacity | ~1x (at limit) | 5-10x headroom | | Availability SLO | Unmeasured | 99.9% measured and tracked | | Engineering time on redundant infra work | ~15-20% | <5% | | Time to add export to new feature | 2-3 weeks | 1-2 days (SDK integration) | --- *This plan should be reviewed with engineering leadership, SRE, and feature team leads before execution begins. Priorities may shift based on the query audit findings in Week 1-2 and the capability catalog in Week 2-4.*