# Engineering Culture Improvement Plan ## Context Summary - **Company**: B2B analytics SaaS - **Team**: 40 engineers across 5 teams - **Architecture**: Rails monolith + 3 Go microservices - **Current deploy cadence**: Twice/week via manual release trains - **Incident rate**: ~2 P1 incidents/month - **Key symptoms**: Slow PR reviews (3+ days), no on-call ownership, platform team bottleneck, poor mid-sprint communication - **Goal**: Daily deploys + 50% P1 reduction within one quarter --- ## 1. Deployment Pipeline & Release Process ### Current Problem Manual release trains twice a week create large batch sizes, increase risk per deploy, and slow feedback loops. ### Recommendations **Move to continuous delivery with feature flags** - Adopt a trunk-based development model. Engineers merge small PRs to `main` daily. - Implement a feature flag system (e.g., LaunchDarkly, Flipper for Rails, or a lightweight internal solution). Every new feature ships behind a flag so that deploys are decoupled from releases. - Replace the manual release train with an automated CI/CD pipeline that deploys to production on every green merge to `main`. **Invest in deployment confidence** - Require a passing CI suite (unit, integration, and a lightweight smoke test against staging) before any merge. - Add automated canary or rolling deploys for the Rails monolith. Route 5% of traffic to the new version, monitor error rates and latency for 10 minutes, then promote or roll back automatically. - For the Go microservices, implement blue-green or canary deploys via your container orchestrator (Kubernetes or similar). **Reduce batch size** - Set a soft guideline: PRs should be under 300 lines of diff. Anything larger needs a justification or should be broken into a stack. - Encourage short-lived branches (< 1 day). ### Timeline - **Weeks 1-2**: Set up feature flag infrastructure and automated CI gating. - **Weeks 3-4**: Implement canary deploys on one service as a pilot. - **Weeks 5-8**: Roll out automated deploys across all services. Sunset the manual release train. - **Weeks 9-12**: Refine, monitor, and achieve daily (or more frequent) deploys. --- ## 2. PR Review Process & Code Velocity ### Current Problem PRs sit in review for 3+ days. This kills velocity, increases merge conflicts, and demoralizes engineers. ### Recommendations **Establish review SLAs** - First meaningful review within 4 business hours. Final approval within 24 hours. - Track these metrics and make them visible on a team dashboard (use a tool like LinearB, Sleuth, or a simple script against your Git provider's API). **Assign clear reviewers** - Use CODEOWNERS files to auto-assign reviewers based on file paths. This eliminates the "who should review this?" ambiguity. - Each PR should have exactly one required reviewer (not two or three). Trust your engineers. **Reduce review burden** - Invest in automated linting, formatting (RuboCop for Rails, `gofmt`/`golangci-lint` for Go), and static analysis. Robots should catch style issues, not humans. - Write clear PR descriptions with a template: What changed? Why? How to test? Any risks? - Encourage authors to self-review before requesting review and to annotate non-obvious sections. **Cultural shifts** - Normalize small PRs. A PR that takes 15 minutes to review gets reviewed fast. A PR that takes 2 hours sits. - Introduce "review blocks" -- 30-minute windows at 10am and 2pm where engineers prioritize clearing their review queue. - Recognize and praise fast, high-quality reviewers publicly. ### Timeline - **Week 1**: Implement CODEOWNERS, PR templates, and automated linting. - **Week 2**: Announce review SLAs. Start tracking metrics. - **Weeks 3-12**: Monitor, coach, and iterate. Publicly celebrate improvements. --- ## 3. On-Call & Incident Management ### Current Problem Random on-call rotation means no ownership, no expertise building, and slow incident response. Two P1s/month is too many. ### Recommendations **Structured on-call rotation by team** - Each of the 5 teams owns the services and code they build. On-call rotates within the team on a weekly basis. - Every team has a primary and secondary on-call. Primary responds; secondary is backup. - Use PagerDuty or Opsgenie to manage rotations, escalation policies, and alerting. **On-call onboarding & runbooks** - Each team maintains runbooks for their services covering: common alerts, diagnostic steps, rollback procedures, escalation criteria. - New on-call engineers shadow for one rotation before going primary. - Provide a "first 5 minutes" checklist for every alert: check dashboards, check recent deploys, check dependent services. **Incident response process** - Define severity levels clearly (P1 = customer-facing outage or data integrity issue, P2 = degraded service, etc.). - For P1s: Incident commander (rotating role) coordinates. Communication goes to a dedicated Slack channel. Status updates every 15 minutes. - Mandatory blameless post-mortems for every P1 within 48 hours. Post-mortems must identify contributing causes and produce action items with owners and deadlines. **Reduce P1 frequency** - Analyze the last 6 months of P1s. Categorize them (deploy-related, infrastructure, data pipeline, third-party dependency, etc.). Attack the top category. - Improve observability: structured logging, distributed tracing (e.g., Datadog, Honeycomb), and SLO-based alerting rather than threshold-based alerting. - Require pre-deploy checklists for high-risk changes (database migrations, schema changes, infrastructure modifications). ### Timeline - **Weeks 1-2**: Define severity levels, set up PagerDuty with team-based rotations, create escalation policies. - **Weeks 3-4**: Teams draft initial runbooks. Conduct a P1 retrospective analysis. - **Weeks 5-8**: Implement observability improvements targeting the top P1 category. - **Weeks 9-12**: Refine alerting, complete runbook coverage, measure P1 trend. --- ## 4. Platform Team & Internal Developer Experience ### Current Problem The platform team is a bottleneck for every feature team. This creates dependencies, waiting, and frustration. ### Recommendations **Shift platform from gatekeeper to enabler** - Platform team should build self-service tools, not do work on behalf of feature teams. The goal is to make feature teams autonomous. - Identify the top 5 reasons feature teams file tickets to platform. Build self-service solutions for at least 3 of them within the quarter. **Specific self-service targets (likely candidates)** - **Infrastructure provisioning**: Provide Terraform modules or internal CLI tools so feature teams can spin up their own staging environments, add new queues, or create database read replicas without a platform ticket. - **CI/CD pipeline configuration**: Let feature teams own their pipeline configs (e.g., `.github/workflows/` or equivalent) with platform-provided templates. - **Observability setup**: Provide dashboards-as-code templates so teams can instrument and monitor their own services. **Embed, don't centralize** - Consider rotating a platform engineer into each feature team for 2-week stints to transfer knowledge and identify friction points. - Establish "platform office hours" (2 hours/week) for ad-hoc questions instead of requiring tickets for everything. **Define a platform product roadmap** - Treat the platform as an internal product. The platform team's "customers" are the feature teams. - Run a quarterly survey or retrospective asking feature teams: What slows you down? What do you need from platform? - Prioritize platform work based on developer-hours-saved, not technical elegance. ### Timeline - **Weeks 1-2**: Survey feature teams to identify top friction points. Audit current ticket backlog. - **Weeks 3-6**: Build and ship first 2 self-service solutions. - **Weeks 7-10**: Ship 1-2 more. Begin platform embedding rotation. - **Weeks 11-12**: Re-survey. Measure reduction in platform tickets. --- ## 5. Communication & Sprint Transparency ### Current Problem PMs complain that engineering "goes dark" mid-sprint. This erodes trust and leads to misaligned priorities. ### Recommendations **Async status updates** - Engineers post a brief daily standup update in a shared Slack channel or tool (Geekbot, Standuply, or just a pinned thread): What I did, what I'm doing, any blockers. - This replaces or supplements synchronous standups, which often become rote. **Mid-sprint check-ins** - At the midpoint of each sprint (e.g., Wednesday of a 2-week sprint), hold a 30-minute sync between the tech lead and PM for each team. Review: Are we on track? Any scope changes needed? Any surprises? - If something is at risk, surface it here -- not at sprint review. **Work-in-progress visibility** - Use your project management tool (Jira, Linear, etc.) rigorously. Every piece of work should have a ticket. Ticket status should reflect reality. - Automate status transitions where possible (e.g., PR opened = "In Review", PR merged = "Done"). **Demo culture** - End every sprint with a 30-minute demo. Engineers show working software, not slides. PMs, designers, and stakeholders attend. - This builds shared understanding and celebrates progress. **Escalation norms** - Define a clear norm: If a task is blocked for more than half a day, escalate. No one should silently spin. - Create a lightweight escalation path: engineer -> tech lead -> engineering manager. Response expected within 2 hours during business hours. ### Timeline - **Week 1**: Set up async standup tooling. Announce new norms. - **Week 2**: Begin mid-sprint check-ins. - **Weeks 3-4**: Automate ticket status transitions. Introduce demo culture. - **Weeks 5-12**: Iterate based on PM and engineer feedback. --- ## 6. Metrics & Accountability ### What to Measure Track these weekly and review them in a monthly engineering leadership meeting: | Metric | Current Baseline | 90-Day Target | |---|---|---| | Deploy frequency | 2x/week | 1x/day (minimum) | | PR review time (first review) | 3+ days | < 4 hours | | PR merge time (open to merge) | 4-5 days (est.) | < 24 hours | | P1 incidents/month | 2 | 1 or fewer | | Mean time to recovery (MTTR) | Unknown (measure) | < 1 hour | | Platform team tickets from feature teams | Unknown (measure) | 50% reduction | | Sprint goal completion rate | Unknown (measure) | > 80% | ### How to Measure - **Deploy frequency**: Count production deploys per day from CI/CD logs. - **PR metrics**: Use GitHub/GitLab API or a tool like LinearB. - **Incident metrics**: Track in PagerDuty or your incident management tool. - **Platform tickets**: Track in your ticketing system with a "platform" label. - **Sprint completion**: Track in your project management tool. ### Accountability - Each team's tech lead owns their team's metrics and reports weekly. - The engineering manager reviews cross-team trends monthly. - Share a monthly "engineering health" summary with the broader org (PMs, leadership) to build trust. --- ## 7. Implementation Roadmap (12-Week Overview) ### Phase 1: Foundation (Weeks 1-4) - Set up feature flag infrastructure - Implement CODEOWNERS and PR review SLAs - Define incident severity levels and set up PagerDuty with team-based rotations - Survey feature teams on platform friction points - Launch async standups and mid-sprint check-ins - Begin measuring all baseline metrics ### Phase 2: Build (Weeks 5-8) - Roll out automated deploys (canary/rolling) across services - Ship first self-service platform tools - Implement observability improvements targeting top P1 category - Teams draft runbooks; conduct P1 retrospective - Automate ticket status transitions - Review and adjust PR review SLAs based on data ### Phase 3: Scale & Refine (Weeks 9-12) - Achieve daily deploys; sunset manual release trains - Ship remaining self-service platform tools - Refine alerting and on-call processes - Re-survey feature teams on platform experience - Conduct quarter retrospective; measure against targets - Document what worked and plan next quarter --- ## 8. Risks & Mitigations | Risk | Mitigation | |---|---| | Engineers resist daily deploys due to fear of breaking production | Start with low-risk services. Invest heavily in canary deploys and automated rollback. Celebrate successful daily deploys. | | PR review SLAs feel like surveillance | Frame as a team norm, not a management mandate. Track at team level, not individual level initially. | | On-call burden feels unfair across teams (some services are noisier) | Invest in reducing alert noise first. Compensate on-call with time off or stipend. | | Platform self-service tools take too long to build | Start with the simplest wins (documentation, templates, scripts) before building full self-service platforms. | | PMs over-index on new communication rituals | Keep ceremonies lightweight. If a mid-sprint check-in has nothing to discuss, cancel it. Avoid process theater. | | Quarter is too short to achieve all goals | Prioritize deploy frequency and P1 reduction as the primary goals. Communication and platform improvements are supporting goals that may extend into Q2. | --- ## 9. Cultural Principles to Reinforce Throughout this transformation, consistently reinforce these principles: 1. **Ownership over assignment.** Teams own their services end-to-end: building, shipping, running, and fixing them. 2. **Small batches over big bangs.** Small PRs, small deploys, small experiments. Reduce the blast radius of everything. 3. **Transparency over opacity.** Share status, share metrics, share post-mortems. Default to open. 4. **Automation over manual gates.** If a human is doing something a machine could do, automate it. 5. **Speed and safety are not tradeoffs.** Deploying more frequently with smaller changes is both faster and safer than deploying less frequently with larger changes. --- ## 10. Quick Wins (First 2 Weeks) To build momentum, prioritize these high-impact, low-effort changes immediately: 1. **Set up CODEOWNERS** -- 1 hour of work, immediate impact on review assignment. 2. **Add automated linting to CI** -- eliminates an entire class of review comments. 3. **Announce PR review SLA** (4-hour first review) -- costs nothing, sets expectations. 4. **Assign on-call by team** in PagerDuty -- can be done in a day. 5. **Start async standups** -- pick a Slack bot and turn it on. 6. **Schedule the first mid-sprint PM check-in** -- put it on the calendar now. 7. **Pull the last 6 months of P1 data** and categorize root causes -- the analysis alone will reveal your highest-leverage improvement. These seven actions can all be completed in the first two weeks and will create visible, immediate improvements in velocity, reliability, and communication.