# Platform & Infrastructure Improvement Pack **Company:** B2B Analytics SaaS (Series B, 50 engineers) **Date:** 2026-03-17 **Decision statement:** We will extract export, filtering, and permissions into shared platform services, define a Postgres scaling plan with lead-time-aware triggers, and commit to reliability SLOs -- all sequenced by blast radius so the highest-leverage work ships first, enabling the enterprise push without a firefighting spiral. --- ## 1) Context Snapshot - **System(s) in scope:** Core B2B analytics SaaS platform -- all product services, the shared Postgres database, and the internal tooling layer consumed by feature teams. - **Users/customers:** Enterprise and mid-market analytics buyers; internal consumers are ~8-10 feature teams (50 engineers total). - **Primary pains (1-3):** 1. **Developer velocity** -- every feature team re-implements export, filtering, and permission checks, creating duplicated effort and inconsistent behavior. 2. **Database scaling** -- Postgres at 500 GB with increasing query latency; 5x traffic growth expected in 6 months from enterprise push. 3. **Reliability risk** -- no formal SLOs; enterprise customers will demand contractual uptime and performance guarantees. - **Time horizon / deadline:** 6 months to enterprise launch readiness. Postgres scaling work must begin immediately given lead times. - **Stakeholders / decision-maker(s):** VP Engineering (decision-maker), Platform/Infra lead (DRI for shared services), Product Engineering leads (consumers), SRE/on-call rotation (reliability ownership). - **Constraints (security/compliance, staffing, risk tolerance):** - Series B staffing: no dedicated platform team yet; will need to carve out 4-6 engineers from feature teams or hire. - Enterprise push implies SOC 2 / data residency requirements are imminent. - Risk tolerance: moderate -- can tolerate planned migrations but not extended outages or data loss. - **Assumptions (explicit):** - A1: Current Postgres instance is a single primary with read replicas (no sharding today). - A2: Feature teams number 8-10, each with 4-6 engineers; at least 4 teams have built their own export, filtering, or permissions logic. - A3: No formal SLOs exist today; monitoring is basic (uptime pings, some application metrics). - A4: The enterprise push will bring customers with contractual SLA requirements (99.9%+ availability). - A5: Current query latency degradation is primarily from large analytical queries competing with transactional workload on the same Postgres instance. - **Success definition (measures):** - Export, filtering, and permissions available as platform services consumed by >= 3 teams within 4 months. - Postgres scaling plan executed with headroom for 5x growth before enterprise launch. - Published SLOs for top 5 user journeys with measurement infrastructure in place. - Zero P0 incidents caused by DB saturation or permission inconsistencies during enterprise onboarding. - **Non-goals / out of scope:** - Rewriting the entire application architecture or migrating off Postgres entirely. - Product/market positioning of the analytics platform (use `platform-strategy`). - Broader technical roadmap sequencing beyond infra (use `technical-roadmaps`). - Legacy code cleanup unrelated to shared capabilities (use `managing-tech-debt`). - Engineering culture or process changes (use `engineering-culture`). --- ## 2) Shared Capabilities Inventory + Platformization Plan ### Shared Capabilities Inventory | Capability | Current duplication (where/how) | Consumer teams/services | Proposed platform contract (API/schema/SDK) | Migration approach | Expected impact | Risks | |---|---|---:|---|---|---|---| | **Data Export Service** | 4+ teams each built CSV/Excel/PDF export with own queuing, formatting, progress tracking. Different timeout handling, file size limits, and error behavior across teams. | 5 | REST API: `POST /platform/exports` (accepts query definition, format, delivery method). Async job with webhook/polling status. SDK wrapper for common languages. Returns signed download URL. | Phase 1: New exports use platform service. Phase 2: Migrate existing exports team-by-team with adapter shim (old endpoints proxy to new service). Phase 3: Deprecate team-specific implementations over 2 sprints per team. | Eliminates ~3 weeks/quarter of duplicated export work across teams. Consistent UX (progress bars, retry, size limits). Single place to enforce export audit logging for compliance. | Migration friction if teams have custom export formats. Must support current file-size limits during transition. | | **Filtering & Query Engine** | 4+ teams built bespoke filtering UIs and query builders. Different syntax, operators, and performance characteristics. Some teams hit Postgres directly; others use materialized views. | 6 | Internal SDK/library: `FilterEngine.build(schema, filters) -> SQL/query`. Shared filter grammar (field, operator, value, combinator). Server-side validation and query plan analysis (reject queries exceeding cost threshold). | Phase 1: Ship SDK as internal package; new features adopt it. Phase 2: Teams wrap existing filters with adapter that delegates to SDK. Phase 3: Remove bespoke query builders over 3-month window. | Consistent filter behavior across product. Single optimization point for query performance. Blocks dangerous queries before they hit Postgres. | Filter grammar must be expressive enough for all current use cases. Performance regression risk if SDK adds overhead; mitigate with benchmarking. | | **Permissions Service** | 3+ teams implemented role checks, feature flags, and entitlement gates independently. Inconsistent enforcement (some check at API layer, some at DB layer, some at UI only). | 7 | gRPC service: `PermissionsService.Check(subject, action, resource) -> {allowed, reason}`. Policy-as-code (OPA/Cedar). SDK with middleware for common frameworks. Caching layer (local + distributed) with TTL-based invalidation. | Phase 1: Deploy permissions service alongside existing checks (shadow mode -- log discrepancies, don't enforce). Phase 2: Flip enforcement to platform service per team/endpoint. Phase 3: Remove inline permission logic. | Consistent access control (critical for enterprise/SOC 2). Single audit log for all permission decisions. Eliminates ~2 weeks/quarter of duplicated authz work. | Shadow mode must run long enough to catch edge cases. Latency budget: permission checks must add < 5 ms p99. Cache invalidation bugs could cause access control failures. | ### Platformization Decisions - **What becomes a shared primitive (and why):** - **Export** -- 5 consumers, high duplication, compliance requirement for audit trail. Stable contract surface (input: query + format; output: file). - **Filtering** -- 6 consumers, highest duplication count, and directly tied to Postgres performance problems (unoptimized queries). Centralizing this is also a scaling lever. - **Permissions** -- 7 consumers (every team needs it), enterprise customers require consistent RBAC, and SOC 2 demands a single audit trail. Inconsistent enforcement is a security risk. - **What remains product-specific (and why):** - **Visualization rendering** -- highly product-specific; each analytics view has unique charting/rendering needs. Not enough commonality for a shared primitive yet. - **Notification preferences** -- only 2 teams use notifications today and the UX requirements differ significantly. Revisit when a third consumer appears. - **Custom report scheduling** -- closely tied to individual product domains; too early to abstract. - **Ownership model:** - Dedicated **Platform Services team** (4-6 engineers, carved from feature teams + 2 new hires). This team owns the shared services, SLOs, and migration support. - Feature teams own integration/migration of their code to platform services. Platform team provides pairing support during migration sprints. - **Versioning + backwards compatibility plan:** - Semantic versioning for all platform service APIs and SDKs. - **Breaking changes require a 2-sprint deprecation window** with migration guide. - Export and Permissions services: versioned API paths (`/v1/`, `/v2/`). Old versions supported for 3 months after new version GA. - Filtering SDK: major version bumps require opt-in; minor/patch versions are backward-compatible. --- ## 3) Quality Attributes Spec (SLOs/SLIs + Privacy/Safety) ### Reliability Targets - **Availability:** 99.9% measured monthly for all tier-1 user journeys (see SLO table below). This translates to ~43 minutes of allowed downtime per month. - **Error rate:** < 0.1% 5xx error rate on tier-1 APIs measured over rolling 7-day windows. - **MTTR (Mean Time to Recover):** < 30 minutes for P0 incidents (complete service unavailability); < 2 hours for P1 (degraded but functional). - **Error budget policy:** When monthly error budget is < 25% remaining, freeze non-critical deployments and prioritize reliability work until budget resets. ### Performance Targets - **Dashboard load (primary journey):** p95 < 2 seconds, p99 < 4 seconds end-to-end. - **API response (CRUD operations):** p95 < 200 ms, p99 < 500 ms. - **Export jobs:** Initiation < 1 second; completion for datasets < 100 MB within 60 seconds. Larger exports: progress updates every 10 seconds. - **Permission checks:** p99 < 5 ms (cached), p99 < 50 ms (uncached). - **Filter query execution:** p95 < 500 ms for standard filters; queries exceeding 5 seconds are killed and user is prompted to narrow scope. ### Privacy/Safety Requirements - **Encryption:** TLS 1.2+ in transit; AES-256 at rest for all data stores (Postgres, object storage, caches). - **Access control:** RBAC enforced through the Permissions Service for all API endpoints. No direct DB access from application code without going through the service layer. - **Data residency:** Prepare for regional deployment (US, EU) to support enterprise data residency requirements. Architecture must support tenant-level data isolation. - **Retention:** Define retention policies per data class: operational data (2 years), audit logs (7 years), analytics events (1 year raw, aggregated indefinitely). Automated purge jobs. - **Audit trail:** All permission checks, data exports, and admin actions logged to immutable audit store. Required for SOC 2 Type II. ### Operability Requirements - **Dashboards:** Unified platform health dashboard (Datadog/Grafana) covering: DB metrics, API latency/error rates, export job queue depth, permission service latency, SLO burn rate. - **Alerts:** PagerDuty integration. Alert on SLO burn rate (fast burn: 10x consumption rate, slow burn: 2x consumption rate). DB-specific alerts on connection count, replication lag, disk usage, query duration. - **Runbooks:** One runbook per P0 scenario (DB failover, permission service outage, export queue backup, full disk). Runbooks linked from alert definitions. - **On-call:** Platform team owns a dedicated on-call rotation. Feature teams handle product-specific incidents but escalate to platform on-call for shared service issues. ### Cost Guardrails - **Top drivers:** Postgres (compute + storage), application compute (Kubernetes), object storage (exports), observability tooling. - **Monthly budget caps:** Set alerts at 80% and 100% of monthly infrastructure budget. Any single service exceeding 120% of its allocation triggers cost review. - **Optimization targets:** Reduce per-query cost by 40% through filtering engine optimization and read replica routing. Export storage: auto-expire files after 7 days. ### Proposed SLOs/SLIs | User journey / API | SLI | SLO target | Measurement method | Owner | Notes | |---|---|---|---|---|---| | Dashboard load (primary) | Time from request to interactive render | p95 < 2 s, p99 < 4 s | RUM (Real User Monitoring) + synthetic checks every 60 s | Product Eng + Platform | Tier-1 journey; measured end-to-end including API + rendering | | API CRUD operations | Server-side latency (request received to response sent) | p95 < 200 ms, p99 < 500 ms | Application metrics (histogram) | Platform team | Excludes network transit; measured at load balancer | | Data export completion | Time from job creation to download-ready | < 60 s for datasets < 100 MB | Export service metrics (job duration histogram) | Platform team | Larger exports measured separately; SLO applies to 90th percentile of jobs | | Permission check latency | Latency of `Check()` RPC | p99 < 5 ms (cached), p99 < 50 ms (uncached) | gRPC service metrics | Platform team | Cache hit rate target: > 95% | | Overall availability | Successful requests / total requests (excluding maintenance) | 99.9% monthly | Load balancer access logs + health checks | SRE / Platform team | 43 min downtime budget per month | | Filter query execution | Query execution time for standard filter operations | p95 < 500 ms | DB query metrics + application instrumentation | Platform team | Queries exceeding 5 s are killed; tracked separately as "timeout rate" | --- ## 4) Scaling "Doomsday Clock" + Capacity Plan ### Doomsday Clock | Component/limit | Metric | Current | Trigger threshold | Estimated lead time to mitigate | Mitigation project | Owner | |---|---|---:|---:|---|---|---| | **Postgres disk (500 GB)** | Total DB size (GB) | 500 GB | 650 GB (70% of typical managed instance max before perf cliff) | 6-8 weeks | Data archival + partitioning (see below) | Platform lead | | **Postgres IOPS** | Read/Write IOPS | ~8,000 (est.) | 12,000 (80% of provisioned IOPS on current instance class) | 4-6 weeks | Read replica routing for analytics queries + connection pooler (PgBouncer) | Platform lead | | **Postgres connections** | Active connections | ~150 (est.) | 300 (75% of max_connections, typically 400 on managed instances) | 2-3 weeks | PgBouncer connection pooling; review connection lifecycle in application code | Platform eng | | **Postgres query latency** | p95 query duration (ms) | ~800 ms (est., degrading) | 500 ms (target), 1,500 ms (critical) | 4-6 weeks | Separate OLTP/OLAP workloads; read replicas for heavy analytics; query optimization via filtering engine | Platform lead | | **Postgres replication lag** | Replica lag (seconds) | < 1 s (est.) | 10 s sustained | 2-3 weeks | Investigate write amplification; tune WAL settings; consider logical replication for selective tables | Platform eng | | **Application compute (K8s)** | CPU/memory utilization across pods | ~55% (est.) | 75% sustained over 1 hour | 1-2 weeks | Horizontal auto-scaling policy; right-size pod resource requests | SRE | | **Export queue depth** | Pending export jobs | ~20 (est.) | 200 (indicates backlog buildup) | 1-2 weeks | Auto-scale export workers; implement priority queue (enterprise jobs first) | Platform eng | | **Object storage (exports)** | Total stored export files (GB) | ~50 GB (est.) | 500 GB (cost threshold) | 1 week | Auto-expire exports after 7 days; lazy-generate on re-request | Platform eng | ### Capacity Plan **Top scaling risks (ordered by time-to-breach):** 1. **Postgres disk + query latency (CRITICAL -- breach in ~3 months at current growth):** At 5x traffic growth, the 500 GB database will approach managed instance limits within 3 months. Query latency is already degrading, indicating the problem is immediate. 2. **Postgres IOPS + connections (HIGH -- breach in ~4 months):** 5x traffic means ~5x connection demand and proportional IOPS increase. Connection pooling buys time but doesn't solve the fundamental read/write contention. 3. **Export queue saturation (MEDIUM -- breach in ~5 months):** Enterprise customers will drive heavier export usage; queue must scale horizontally. **Proposed scaling projects (sequenced by urgency):** **Project S1: Postgres Workload Separation (Month 1-2)** - Separate OLTP (transactional) and OLAP (analytical/reporting) workloads. - Route read-heavy analytics queries to dedicated read replicas. - Deploy PgBouncer for connection pooling (reduce active connections by ~60%). - Expected outcome: Buys 6+ months of headroom on connections and IOPS. **Project S2: Data Archival + Table Partitioning (Month 2-3)** - Implement time-based partitioning on the largest tables (event logs, audit trails, analytics data). - Archive data older than 12 months to cold storage (S3 + Athena for ad-hoc queries). - Target: Reduce active DB size from 500 GB to ~200 GB. - Expected outcome: Significant improvement in query performance; disk pressure eliminated for 12+ months. **Project S3: Filtering Engine Query Optimization (Month 2-4)** - Deploy the shared Filtering SDK with built-in query cost analysis. - Kill queries exceeding cost threshold; guide users to narrow filters. - Add query plan caching for common filter patterns. - Expected outcome: 40% reduction in average query cost; eliminates runaway queries. **Project S4: Evaluate Postgres Vertical Upgrade vs. Citus/Read Scaling (Month 3-4)** - If S1-S3 are insufficient for 5x headroom, evaluate: - **Option A:** Vertical upgrade to larger instance class (quick but has ceiling). - **Option B:** Citus extension for horizontal scaling (distributes large tables across nodes). - **Option C:** Introduce a dedicated analytical data store (ClickHouse/Redshift) for reporting workloads, keeping Postgres lean for OLTP. - Decision criteria: cost, migration complexity, operational burden, and headroom provided. **Feature-freeze / priority policy when triggers fire:** - **Yellow (trigger threshold reached):** Scaling work becomes P1; no new features that increase DB load. Platform team gets 2 additional engineers from feature teams. - **Red (critical threshold reached):** Full feature freeze on DB-intensive work. All available engineers support scaling mitigation. Stakeholder communication within 4 hours of red status. - **Monitoring:** Weekly capacity review meeting (30 min) until all metrics are below 50% of trigger thresholds. --- ## 5) Instrumentation Plan (Observability + Server-Side Analytics) ### Observability Gaps | Area | Current state | Gap | Proposed instrumentation | Owner | Priority | |---|---|---|---|---|---| | **Database metrics** | Basic uptime monitoring | No query-level latency tracking, no connection pool metrics, no replication lag alerts | Postgres exporter (prometheus) + PgBouncer metrics. Dashboards: query duration histograms, connection utilization, replication lag, table bloat, cache hit ratio. Alerts: p95 query > 500 ms, connections > 300, replication lag > 10 s. | Platform eng | P0 | | **SLO burn rate** | No SLOs defined | No burn-rate tracking or alerting | Implement SLO tracking (Datadog SLO monitors or Prometheus + sloth). Multi-window burn-rate alerts (fast: 5 min window, slow: 1 hr window). Dashboard showing remaining error budget per SLO. | SRE / Platform | P0 | | **Platform service health** | N/A (services don't exist yet) | No metrics for new shared services | Each platform service (Export, Filtering, Permissions) ships with: request rate, error rate, latency histograms, queue depth (export), cache hit rate (permissions). Standard RED metrics dashboard per service. | Platform eng | P1 (ship with services) | | **Distributed tracing** | Partial or absent | Cannot trace a request end-to-end across services | Deploy OpenTelemetry SDK across all services. Trace context propagation through HTTP headers and gRPC metadata. Sample rate: 100% for errors, 10% for success in production. | Platform eng | P1 | | **Cost monitoring** | Cloud provider billing dashboard only | No per-service or per-feature cost attribution | Tag all infrastructure resources by service/team. Weekly automated cost report. Alert on >20% week-over-week increase per service. | SRE | P2 | | **Export job observability** | Basic job success/fail logging | No duration tracking, no queue depth visibility, no per-tenant metrics | Export service emits: job_created, job_started, job_completed, job_failed events with duration, file size, tenant_id. Dashboard: queue depth, completion time histogram, failure rate by type. | Platform eng | P1 | ### Server-Side Analytics Event Contract - **Canonical identity fields:** - `user_id` (UUID) -- authenticated user; always present for logged-in actions. - `account_id` (UUID) -- the organization/tenant; always present. - `anonymous_id` (UUID) -- generated client-side for pre-auth tracking; merged to `user_id` on login via server-side merge event. - **Merge rules:** On authentication, emit `identity_merged(anonymous_id, user_id, account_id)`. Analytics pipeline deduplicates and re-attributes pre-auth events to the resolved user. - **Delivery semantics:** - **At-least-once** delivery from application to event bus (Kafka/SQS). - **Dedupe strategy:** Every event carries a `event_id` (UUID v7, time-sortable). Consumers deduplicate on `event_id` within a 24-hour window. - Events are produced server-side at the point of action completion (not on request receipt). - **Schema/versioning:** - JSON Schema registry (e.g., SchemaStore in a git repo or a schema registry service). - Events follow `noun.verb` naming convention (e.g., `export.completed`, `filter.applied`, `permission.checked`). - Schema changes require a PR review; breaking changes produce a new event version (`export.completed.v2`) with a 3-month overlap period. - **Data QA checks:** - **Schema validation:** Events validated against JSON Schema at production time (reject malformed events to dead-letter queue). - **Volume anomaly detection:** Alert if any event type volume drops > 50% or increases > 300% compared to 7-day rolling average. - **Null-rate checks:** Alert if required fields have null rate > 1%. - **Dedupe rate monitoring:** Track duplicate event rate; alert if > 5% (indicates producer retry storms). ### Event Taxonomy (Starter Table) | Event name | When emitted (server action) | Required properties | Identity fields | Consumers (teams) | Notes | |---|---|---|---|---|---| | `dashboard.loaded` | Server completes data fetch for dashboard render | `dashboard_id`, `query_count`, `total_duration_ms`, `data_points_returned` | `user_id`, `account_id` | Product analytics, Performance monitoring | Primary journey; correlate with RUM for full picture | | `export.requested` | Export job created in queue | `export_id`, `format` (csv/xlsx/pdf), `estimated_rows`, `filter_hash` | `user_id`, `account_id` | Platform team, Product analytics | Track export patterns to optimize common formats | | `export.completed` | Export file ready for download | `export_id`, `format`, `file_size_bytes`, `duration_ms`, `row_count` | `user_id`, `account_id` | Platform team, Billing (large exports) | Used for SLO measurement | | `export.failed` | Export job fails after retries exhausted | `export_id`, `error_type`, `error_message`, `retry_count` | `user_id`, `account_id` | Platform team, SRE | Triggers alert if failure rate > 2% | | `filter.applied` | Filter query executed via Filtering SDK | `filter_hash`, `field_count`, `query_duration_ms`, `rows_scanned`, `rows_returned` | `user_id`, `account_id` | Platform team, Product analytics | Feeds query optimization; identifies expensive patterns | | `filter.rejected` | Query killed due to cost threshold | `filter_hash`, `estimated_cost`, `threshold`, `rejection_reason` | `user_id`, `account_id` | Platform team, Product (UX improvement) | Track to improve filter UX guidance | | `permission.checked` | Permission service processes a Check() call | `subject_id`, `action`, `resource_type`, `resource_id`, `result` (allowed/denied), `latency_ms`, `cache_hit` | `user_id`, `account_id` | Security, Compliance/Audit | High-volume; sample at 10% for analytics, 100% for audit log | | `permission.denied` | Permission check returns denied | `subject_id`, `action`, `resource_type`, `resource_id`, `reason` | `user_id`, `account_id` | Security, Product (UX -- show proper error) | 100% capture; used for security review | | `identity.merged` | User authenticates, linking anonymous to known | `anonymous_id`, `method` (password/sso/oauth) | `user_id`, `account_id`, `anonymous_id` | Analytics pipeline | Triggers re-attribution of pre-auth events | | `account.limit_approached` | Tenant usage approaches plan limit | `limit_type`, `current_value`, `limit_value`, `percentage_used` | `account_id` | Billing, Customer Success, Product | Drives upsell and capacity planning | --- ## 6) Discoverability Plan **Not applicable.** This is a B2B SaaS analytics product, not a content-heavy web property. SEO/discoverability is not a primary concern for the application itself. Marketing site SEO is out of scope for this infrastructure plan. --- ## 7) Execution Roadmap Prioritized by **blast radius** (how many teams/users are affected if we don't act) crossed with urgency (time-to-breach). ### Roadmap | Milestone | Scope | Acceptance criteria | Owner | Dependencies | ETA range | Rollout/rollback | |---|---|---|---|---|---|---| | **M0: Platform Team Formation** | Hire/reassign 4-6 engineers; establish platform team charter, on-call rotation, and communication channels | Team staffed; charter published; on-call rotation active; Slack channel + weekly sync established | VP Engineering | Budget approval; backfill plan for feature teams | Week 1-2 | N/A | | **M1: Emergency DB Relief (PgBouncer + Read Replicas)** | Deploy PgBouncer connection pooler; route analytics read queries to dedicated read replica(s) | Active connections reduced by >= 50%; analytics queries running on replica; p95 query latency reduced by >= 30% | Platform lead | M0 (team exists); DBA access to Postgres config | Week 2-4 | Rollback: disable PgBouncer and revert DNS/connection strings to primary. Read replica routing toggled via feature flag. | | **M2: Observability Foundation** | Deploy Postgres metrics exporter, SLO tracking, distributed tracing (OpenTelemetry), platform health dashboards | All SLO dashboards live; burn-rate alerts firing; DB metrics (connections, IOPS, replication lag, query duration) visible; tracing deployed to >= 3 critical services | SRE / Platform eng | M0; observability tooling access (Datadog/Grafana) | Week 3-5 | Rollback: disable exporters/agents if perf impact; dashboards are additive (no rollback needed). | | **M3: Permissions Service (Shadow Mode)** | Deploy permissions service; integrate with 2 pilot teams in shadow mode (log-only, no enforcement) | Service deployed; shadow mode processing 100% of permission checks for pilot teams; discrepancy rate tracked on dashboard; latency < 5 ms p99 (cached) | Platform eng | M0; policy language chosen (OPA/Cedar); pilot teams identified | Week 4-8 | Rollback: disable shadow mode integration (feature flag per team). No user impact since shadow mode is non-enforcing. | | **M4: Data Archival + Table Partitioning** | Partition largest tables by time; archive data > 12 months to cold storage (S3); verify query performance improvement | Active DB size reduced to < 250 GB; archived data queryable via Athena; no data loss verified via row-count reconciliation; p95 query latency improved by >= 40% | Platform lead + DBA | M1 (read replicas for safe migration); M2 (monitoring to verify) | Week 5-9 | Rollback: partitioning is additive; if issues, queries can still access all partitions. Archive has 30-day restore window from S3. | | **M5: Filtering SDK (v1)** | Ship internal Filtering SDK with query cost analysis; integrate with 2 pilot teams | SDK published as internal package; 2 teams using it in production; queries exceeding cost threshold are rejected; average query cost reduced by >= 25% for pilot teams | Platform eng | M1 (read replicas reduce load); M2 (query metrics to measure improvement) | Week 6-10 | Rollback: teams revert to direct query building (SDK is opt-in via import). Feature flag to disable cost-threshold rejection. | | **M6: Permissions Service (Enforcement)** | Flip from shadow mode to enforcement for pilot teams; roll out to remaining teams | All teams enforcing via platform permissions service; discrepancy rate < 0.1%; audit log capturing 100% of permission decisions; SOC 2 audit trail requirement met | Platform eng + Security | M3 (shadow mode validated); M2 (monitoring in place) | Week 8-12 | Rollback: per-team feature flag reverts to inline permission checks. Gradual rollout: 1 team per week. | | **M7: Export Service (v1)** | Deploy shared export service; migrate 2 pilot teams | Export service handling production traffic for 2 teams; async job processing with status tracking; export SLO met (< 60 s for < 100 MB); audit logging active | Platform eng | M2 (observability); M5 (filtering SDK for export query building) | Week 9-14 | Rollback: pilot teams revert to existing export code (old endpoints remain active during migration). Traffic split via feature flag. | | **M8: Full Rollout + Scaling Evaluation** | All teams on platform services (filtering, permissions, export); evaluate need for S4 (Citus/vertical upgrade/analytical store) | >= 6 teams consuming each platform service; all SLOs met for 2 consecutive weeks; scaling evaluation document published with recommendation | Platform lead | M4-M7 complete; 4 weeks of production data on new architecture | Week 14-20 | Rollback: per-team feature flags for each service. Scaling evaluation informs next phase (no rollback needed). | | **M9: Enterprise Readiness Certification** | Validate all SLOs met under load test simulating 5x traffic; SOC 2 controls verified; runbooks tested via game day | Load test passes with all SLOs green at 5x current traffic; SOC 2 evidence package complete; 1 game day executed with < 30 min MTTR | VP Engineering + Platform lead + Security | M6, M7, M8 complete; load testing infrastructure | Week 20-24 | N/A (validation milestone). If SLOs fail under load test, trigger S4 scaling project immediately. | ### Sequencing Rationale (Blast Radius Priority) 1. **M0-M1 (Weeks 1-4): Stop the bleeding.** DB is the single point of failure for all 50 engineers and all customers. PgBouncer + read replicas are the highest-blast-radius, lowest-effort wins. 2. **M2 (Weeks 3-5): See before you act.** Without observability, every subsequent decision is guesswork. This is foundational. 3. **M3/M4 (Weeks 4-9): Permissions + DB scaling in parallel.** Permissions affects all 7 consumer teams (highest blast radius among shared capabilities). DB archival/partitioning addresses the most urgent scaling risk. 4. **M5/M6 (Weeks 6-12): Filtering + Permissions enforcement.** Filtering SDK directly reduces DB load (scaling lever) and improves developer velocity. Permissions enforcement completes the security/compliance story. 5. **M7 (Weeks 9-14): Export.** Important but lower blast radius than permissions and filtering; fewer teams blocked and no security/compliance urgency. 6. **M8-M9 (Weeks 14-24): Consolidation + certification.** Full rollout, scaling evaluation, and enterprise readiness validation. --- ## 8) Risks / Open Questions / Next Steps ### Risks | Risk | Likelihood | Impact | Mitigation | |---|---|---|---| | **Platform team staffing delay** -- cannot hire or reassign 4-6 engineers quickly enough | Medium | High (entire roadmap slips) | Begin internal reassignment immediately (2-3 engineers); hire for remaining slots in parallel. Feature teams accept temporary velocity reduction. | | **Postgres hits critical threshold before M1 completes** -- query latency becomes unacceptable before PgBouncer/replicas are ready | Medium | High (customer-facing outages) | Fast-track M1 to 2-week delivery. Prepare emergency vertical upgrade as backup (can execute in days). Implement query timeout (kill queries > 10 s) as immediate stopgap. | | **Migration friction underestimated** -- teams resist or are slower than expected to adopt platform services | Medium | Medium (roadmap extends 4-8 weeks) | Dedicated migration support from platform team (pairing). Executive mandate from VP Eng. Track migration per-team on weekly dashboard. | | **Permission service shadow mode reveals deep inconsistencies** -- existing permission implementations disagree significantly | Medium | Medium (delays enforcement) | Extend shadow mode by 2-4 weeks. Triage discrepancies by severity: fix critical ones immediately, defer cosmetic ones. Document "golden" behavior as the authoritative source. | | **5x traffic growth arrives faster than 6 months** -- enterprise deals close early or marketing spike occurs | Low | High (DB/infra crisis) | M1 (DB relief) must complete in 4 weeks regardless. Maintain "emergency scaling playbook" (vertical upgrade + aggressive caching) as a break-glass option. | | **SOC 2 audit requirements are broader than anticipated** -- additional controls needed beyond permissions audit trail | Medium | Medium (scope creep) | Engage security/compliance consultant in Week 1-2 to enumerate full control requirements. Build controls inventory in parallel with M3. | ### Open Questions 1. **Postgres managed service or self-hosted?** If managed (e.g., AWS RDS/Aurora), what is the current instance class and max scaling tier? This affects the ceiling for vertical upgrades and available extensions (Citus). 2. **How many distinct permission models exist across teams?** Need an audit of current RBAC/ACL implementations to understand the scope of consolidation into the permissions service. 3. **Is there an existing data warehouse or analytics pipeline?** If yes, the export service and analytics event contract can leverage it; if no, we need to factor pipeline setup into the roadmap. 4. **What is the current deployment model (K8s, ECS, bare metal)?** This affects how platform services are deployed and how auto-scaling is configured. 5. **Are there existing enterprise customer SLA commitments?** If contractual SLAs already exist, they constrain our SLO targets (SLOs must be stricter than SLAs). 6. **What is the budget for infrastructure scaling?** Vertical Postgres upgrades and Citus licensing have different cost profiles; need budget parameters to recommend the right option. 7. **Is there an existing on-call rotation, or is this being created from scratch?** Affects M0 timeline and platform team formation. ### Next Steps 1. **This week:** VP Engineering approves platform team formation (M0). Identify 2-3 engineers to reassign immediately. Post job requisitions for remaining slots. 2. **This week:** DBA/platform lead begins M1 -- deploy PgBouncer in staging; configure read replica routing. Target production deployment in 2 weeks. 3. **This week:** Run a 1-hour audit of existing permission implementations across all teams. Document the current state to scope M3. 4. **Week 2:** Finalize observability tooling choices (Datadog vs. Grafana stack) and begin M2 implementation. 5. **Week 2:** Publish this Platform & Infrastructure Improvement Pack to all engineering teams. Schedule 30-minute walkthrough for stakeholders. 6. **Week 3:** Hold first weekly capacity review meeting (30 min). Review doomsday clock metrics and track progress on M1. 7. **Week 4:** Evaluate M1 results. If insufficient, fast-track S4 evaluation (vertical upgrade vs. Citus vs. analytical store). 8. **Ongoing:** Bi-weekly platform team retrospective to assess migration progress and surface blockers. --- ## Quality Gate Self-Assessment ### Checklist Verification **A) Scope + contracts** - [x] "When to use / When NOT to use" is explicit; redirects to `platform-strategy`, `technical-roadmaps`, `managing-tech-debt`, and `engineering-culture`. - [x] Inputs are sufficient; missing info handled via 5 explicit assumptions (A1-A5). - [x] Deliverables are explicit and ordered (sections 1-8). **B) Platformization quality** - [x] All 3 shared capability candidates have 2+ consumers (Export: 5, Filtering: 6, Permissions: 7). - [x] Each has a proposed contract (REST API, internal SDK, gRPC service) and ownership model. - [x] Migration/rollout plan exists per capability (phased with shims, shadow mode, deprecation windows). **C) Infrastructure quality attributes** - [x] Reliability and performance targets are measurable (SLOs/SLIs table with specific numbers). - [x] Privacy/safety requirements spelled out (encryption, residency, retention, audit). - [x] Operability covered (dashboards, alerts, runbooks, on-call). - [x] Cost guardrails included (budgets, alerts, optimization targets). **D) Scaling readiness ("doomsday clock")** - [x] 8 limits enumerated with current values (or explicit estimates). - [x] Trigger thresholds account for lead time (e.g., disk trigger at 650 GB with 6-8 week lead time). - [x] Each trigger has an owner and named mitigation project. - [x] Clear yellow/red policy for reprioritization/feature freeze. **E) Instrumentation + analytics** - [x] 6 observability gaps identified with owners and priorities. - [x] 10 canonical events captured server-side. - [x] Identity strategy defined (user_id, account_id, anonymous_id) with merge rules. - [x] Data quality checks defined (schema validation, volume anomalies, null rates, dedupe rates). **F) Discoverability** - [x] Explicitly marked "Not applicable" with rationale. **G) Execution readiness** - [x] 10 milestones with acceptance criteria and owners. - [x] Dependencies and rollout/rollback plans included for every milestone. - [x] Risks (6), open questions (7), and next steps (8) are present and actionable.