3 AM pages aren’t heroic — they’re a system failure.
I’ve spent countless nights staring at alerts, wondering if I should wake up the senior engineer. After a decade of both holding the pager and building on-call systems at companies ranging from tiny startups to successful, sizable companies, I’ve learned that the biggest on-call problems aren’t technical — they’re cultural. Here’s my comprehensive guide to building a sustainable on-call culture.
The Reality of Modern On-Call
Let’s start with a truth that might ruffle some feathers: If your engineers are regularly getting woken up at night, your system is broken. Not just technically, but organizationally.
I learned this lesson the hard way at my first startup. We were a team of four engineers, running a service used by millions. Our on-call rotation was brutal — two people alternating weeks, getting paged 3–4 times per night. We wore those interruptions like badges of honor. “This is just how things are in high-scale systems,” we told ourselves.
Then I joined my next gig, a much larger company. My first week, I noticed something strange: despite handling traffic several orders of magnitude larger than my previous company, the on-call engineers were sleeping through most nights. That’s when I realized we’d been doing it wrong all along.
The Hero Trap: Why Engineers Break Themselves
Most engineers’ first on-call rotation goes something like this: Your pager goes off at 2 AM. Your heart races. You dive in, determined to fix it yourself. Three hours later, you’re still debugging, barely coherent, when a teammate logs in and solves it in 10 minutes.
This “hero” mentality breaks both engineers and systems. Here’s why:
The Cost of Hero Culture
- Sleep deprivation impairs judgment similar to alcohol
- Heroic fixes often address symptoms, not root causes
- Teams become dependent on tribal knowledge
- Documentation remains poor because “heroes” just handle it
- Engineers burn out and leave
At my last company, we lost our best infrastructure engineer because of hero culture. He was the only one who could fix certain problems, so he got called for everything. After six months of interrupted sleep, he quit. The irony? Once he left, we were forced to address our systemic issues properly.
The New On-Call Mindset: Triage Over Heroics
Your job isn’t to fix everything yourself — it’s to ensure issues get addressed properly. Let’s break down what this means in practice.
The Triage Framework
When that alert hits, ask these questions in order:
- System Impact
- Which services are affected?
- Is this customer-facing?
- Are we losing data?
- Is this affecting revenue?
- Scale of Impact
- What percentage of users are impacted?
- Which regions are impacted?
- Is this affecting specific customer segments?
- Urgency Assessment
- Does this need immediate attention?
- What’s the cost of waiting until morning?
- Are there temporary workarounds?
- Skills/Scope Evaluation
- Do I have the expertise to fix this?
- Do I have access to all required systems?
- Who else needs to be involved?
Real-World Triage Example
A couple of jobs back, one day at 3 AM, we got an alert: “High latency in payment processing service.” Old me would have immediately started debugging. New me went through the framework:
- System Impact: Payment processing affected, definitely customer-facing and revenue-impacting
- Scale: 15% of transactions showing high latency, but all eventually succeeding
- Urgency: Medium — transactions are slow but completing
- Skills/Scope: Payment system involves multiple teams
Decision: I woke up our payments team lead. Why? Because even though transactions were completing, the revenue impact meant we needed expert eyes on it. They found that a database index issue that was gradually degrading performance. Fixed in 20 minutes once the right person was involved.
The Science of Escalation
Here’s a counterintuitive truth: The best on-call engineers escalate early and often. Let’s break down when and how to escalate effectively.
When to Escalate
- You’ve spent >30 minutes without a clear direction
- The problem affects core business functions
- You need access you don’t have
- The issue involves multiple systems
- You’re not 100% confident in your fix
- You’re tired and might make mistakes
How to Escalate Effectively
- Prepare Your Information:
- Timeline of the incident
- Current system status
- Actions already taken
- Relevant logs and metrics
- Business impact assessment
- Clear Communication:
“Hey Alice, we have a P1 incident affecting our payment processor. 15% of transactions are seeing 10x normal latency. I’ve verified it’s not the load balancer or recent deployments. Given your expertise with the payment system, I’d value your insight on this.” - Follow-up Documentation:
- Summary of the discussion
- Action items agreed upon
- Next steps and owners
- Timeline for resolution
Documentation: Your 3 AM Best Friend
Every incident needs three types of documentation:
1. Live Incident Notes
2024-02-10 03:15 UTC - Alert triggered: High latency in payments
2024-02-10 03:17 UTC - Verified issue in metrics: p95 latency 2.3s (normal 200ms)
2024-02-10 03:20 UTC - Checked recent deployments - none in last 24h
2024-02-10 03:25 UTC - Investigated load balancer logs - normal traffic patterns
2024-02-10 03:30 UTC - Escalated to Alice (Payments Lead)
2. Investigation Notes
- System components checked
- Commands run and their output
- Hypotheses tested
- Dead ends encountered
- Resource links consulted
3. Resolution Documentation
- Root cause analysis
- Fix implemented
- Verification steps
- Prevention measures
- Follow-up tasks
Building Systems That Let You Sleep
Every alert should trigger two questions:
- How do we fix it right now?
- How do we prevent it from happening again?
System Reliability Checklist
- Monitoring and Alerting
- Clear thresholds based on business impact
- Elimination of alert noise
- Automated resolution for common issues
- Runbooks for all alerts
- System Design
- Circuit breakers for external dependencies
- Graceful degradation paths
- Automated failover where possible
- Rate limiting and backpressure
- Transactions and audit trails for critical operations
- Operational Tools
- One-click rollbacks
- Automated diagnostic tools
- Clear system dashboards
- Centralized logging
- Process Improvements
- Regular disaster recovery testing
- System architecture reviews
- Capacity planning
- Performance testing
The Human Side of On-Call
On-call isn’t just about technical skills. It’s also about:
Clear Communication
- Regular status updates to stakeholders
- Clear escalation paths
- Documented decision-making processes
- Post-incident reviews
Team Support
- Mental health resources
- Rotation flexibility
- Training and shadowing programs
- Recognition and compensation
Stakeholder Management
- Clear incident severity definitions
- Regular communication templates
- Expectations setting
- Business impact assessment
Building a World-Class On-Call Culture
Clear Escalation Paths
Every engineer needs to know exactly whom to contact and when. Here’s a sample escalation matrix:
Severity Levels:
- P0: Critical business impact (Payment system down)
- P1: Significant impact (Major feature unavailable)
- P2: Moderate impact (Performance degradation)
- P3: Minor impact (Non-critical system issues)
Escalation Flow:
- L1: Primary on-call engineer
- L2: Service owner/domain expert
- L3: Engineering manager
- L4: VP of Engineering
Response Times:
- P0: Immediate response required
- P1: 15-minute response time
- P2: 1-hour response time
- P3: Next business day
Comprehensive Runbooks
Every alert should have:
Context Section:
- Alert meaning and impact
- Normal vs. abnormal values
- Recent changes that might affect it
- Business impact assessment
Diagnostic Steps:
- Initial verification steps
- Common causes and solutions
- Data collection requirements
- Escalation criteria
Resolution Steps:
- Step-by-step fix procedures
- Verification methods
- Rollback procedures
- Post-resolution tasks
Fair Rotation Management
A healthy rotation includes:
Schedule Structure:
- Maximum 7 days on primary
- Minimum 14 days between rotations
- At least 5 engineers per rotation
- Clear backup scheduling
Compensation:
- Base on-call stipend
- Per-incident compensation
- Time off after major incidents
- Tool and training budget
Support Systems:
- Mental health resources
- Flexible scheduling
- Training programs
- Regular rotation reviews
Measuring Success
How do you know if your on-call culture is healthy? Here are our key metrics:
Technical Metrics:
- Mean Time Between Failures (MTBF)
- Mean Time To Resolution (MTTR)
- Number of night pages
- False positive rate
- Repeat incident rate
Human Metrics:
- Engineer satisfaction scores
- On-call stress levels
- Training completion rates
- Documentation quality scores
- Post-incident review completion
Business Metrics:
- Customer impact duration
- Revenue impact from incidents
- Time to customer notification
- Resolution SLA compliance
The Path Forward
Remember: The goal isn’t to handle every incident perfectly. It’s to learn from each one while maintaining your sanity and your sleep schedule.
Start with these steps:
- Audit your current state
- Identify the biggest pain points
- Implement quick wins first
- Build long-term improvement plans
- Regularly review and adjust
Your pager will go off again. But with these principles in mind, you’ll handle it with confidence, clarity, and maybe even get back to sleep afterward.
And if you’re currently in a toxic on-call situation? Sometimes the best thing you can do is find a team that values sustainable engineering practices. Your health matters more than any system.
Prioritize sleep over heroics. Your future self will thank you.