On-Call Done Right

3 AM pages aren’t heroic — they’re a system failure.

I’ve spent countless nights staring at alerts, wondering if I should wake up the senior engineer. After a decade of both holding the pager and building on-call systems at companies ranging from tiny startups to successful, sizable companies, I’ve learned that the biggest on-call problems aren’t technical — they’re cultural. Here’s my comprehensive guide to building a sustainable on-call culture.

The Reality of Modern On-Call

Let’s start with a truth that might ruffle some feathers: If your engineers are regularly getting woken up at night, your system is broken. Not just technically, but organizationally.

I learned this lesson the hard way at my first startup. We were a team of four engineers, running a service used by millions. Our on-call rotation was brutal — two people alternating weeks, getting paged 3–4 times per night. We wore those interruptions like badges of honor. “This is just how things are in high-scale systems,” we told ourselves.

Then I joined my next gig, a much larger company. My first week, I noticed something strange: despite handling traffic several orders of magnitude larger than my previous company, the on-call engineers were sleeping through most nights. That’s when I realized we’d been doing it wrong all along.

The Hero Trap: Why Engineers Break Themselves

Most engineers’ first on-call rotation goes something like this: Your pager goes off at 2 AM. Your heart races. You dive in, determined to fix it yourself. Three hours later, you’re still debugging, barely coherent, when a teammate logs in and solves it in 10 minutes.

This “hero” mentality breaks both engineers and systems. Here’s why:

The Cost of Hero Culture

Sleep deprivation impairs judgment similar to alcohol
Heroic fixes often address symptoms, not root causes
Teams become dependent on tribal knowledge
Documentation remains poor because “heroes” just handle it
Engineers burn out and leave

At my last company, we lost our best infrastructure engineer because of hero culture. He was the only one who could fix certain problems, so he got called for everything. After six months of interrupted sleep, he quit. The irony? Once he left, we were forced to address our systemic issues properly.

The New On-Call Mindset: Triage Over Heroics

Your job isn’t to fix everything yourself — it’s to ensure issues get addressed properly. Let’s break down what this means in practice.

The Triage Framework

When that alert hits, ask these questions in order:

System Impact
- Which services are affected?
- Is this customer-facing?
- Are we losing data?
- Is this affecting revenue?
Scale of Impact
- What percentage of users are impacted?
- Which regions are impacted?
- Is this affecting specific customer segments?
Urgency Assessment
- Does this need immediate attention?
- What’s the cost of waiting until morning?
- Are there temporary workarounds?
Skills/Scope Evaluation
- Do I have the expertise to fix this?
- Do I have access to all required systems?
- Who else needs to be involved?

Real-World Triage Example

A couple of jobs back, one day at 3 AM, we got an alert: “High latency in payment processing service.” Old me would have immediately started debugging. New me went through the framework:

System Impact: Payment processing affected, definitely customer-facing and revenue-impacting
Scale: 15% of transactions showing high latency, but all eventually succeeding
Urgency: Medium — transactions are slow but completing
Skills/Scope: Payment system involves multiple teams

Decision: I woke up our payments team lead. Why? Because even though transactions were completing, the revenue impact meant we needed expert eyes on it. They found that a database index issue that was gradually degrading performance. Fixed in 20 minutes once the right person was involved.

The Science of Escalation

Here’s a counterintuitive truth: The best on-call engineers escalate early and often. Let’s break down when and how to escalate effectively.

When to Escalate

You’ve spent >30 minutes without a clear direction
The problem affects core business functions
You need access you don’t have
The issue involves multiple systems
You’re not 100% confident in your fix
You’re tired and might make mistakes

How to Escalate Effectively

Prepare Your Information:
- Timeline of the incident
- Current system status
- Actions already taken
- Relevant logs and metrics
- Business impact assessment
Clear Communication:
“Hey Alice, we have a P1 incident affecting our payment processor. 15% of transactions are seeing 10x normal latency. I’ve verified it’s not the load balancer or recent deployments. Given your expertise with the payment system, I’d value your insight on this.”
Follow-up Documentation:
- Summary of the discussion
- Action items agreed upon
- Next steps and owners
- Timeline for resolution

Documentation: Your 3 AM Best Friend

Every incident needs three types of documentation:

1. Live Incident Notes

2024-02-10 03:15 UTC - Alert triggered: High latency in payments
2024-02-10 03:17 UTC - Verified issue in metrics: p95 latency 2.3s (normal 200ms)
2024-02-10 03:20 UTC - Checked recent deployments - none in last 24h
2024-02-10 03:25 UTC - Investigated load balancer logs - normal traffic patterns
2024-02-10 03:30 UTC - Escalated to Alice (Payments Lead)

2. Investigation Notes

System components checked
Commands run and their output
Hypotheses tested
Dead ends encountered
Resource links consulted

3. Resolution Documentation

Root cause analysis
Fix implemented
Verification steps
Prevention measures
Follow-up tasks

Building Systems That Let You Sleep

Every alert should trigger two questions:

How do we fix it right now?
How do we prevent it from happening again?

System Reliability Checklist

Monitoring and Alerting
- Clear thresholds based on business impact
- Elimination of alert noise
- Automated resolution for common issues
- Runbooks for all alerts
System Design
- Circuit breakers for external dependencies
- Graceful degradation paths
- Automated failover where possible
- Rate limiting and backpressure
- Transactions and audit trails for critical operations
Operational Tools
- One-click rollbacks
- Automated diagnostic tools
- Clear system dashboards
- Centralized logging
Process Improvements
- Regular disaster recovery testing
- System architecture reviews
- Capacity planning
- Performance testing

The Human Side of On-Call

On-call isn’t just about technical skills. It’s also about:

Clear Communication

Regular status updates to stakeholders
Clear escalation paths
Documented decision-making processes
Post-incident reviews

Team Support

Mental health resources
Rotation flexibility
Training and shadowing programs
Recognition and compensation

Stakeholder Management

Clear incident severity definitions
Regular communication templates
Expectations setting
Business impact assessment

Building a World-Class On-Call Culture

Clear Escalation Paths

Every engineer needs to know exactly whom to contact and when. Here’s a sample escalation matrix:

Severity Levels:

P0: Critical business impact (Payment system down)
P1: Significant impact (Major feature unavailable)
P2: Moderate impact (Performance degradation)
P3: Minor impact (Non-critical system issues)

Escalation Flow:

L1: Primary on-call engineer
L2: Service owner/domain expert
L3: Engineering manager
L4: VP of Engineering

Response Times:

P0: Immediate response required
P1: 15-minute response time
P2: 1-hour response time
P3: Next business day

Comprehensive Runbooks

Every alert should have:

Context Section:

Alert meaning and impact
Normal vs. abnormal values
Recent changes that might affect it
Business impact assessment

Diagnostic Steps:

Initial verification steps
Common causes and solutions
Data collection requirements
Escalation criteria

Resolution Steps:

Step-by-step fix procedures
Verification methods
Rollback procedures
Post-resolution tasks

Fair Rotation Management

A healthy rotation includes:

Schedule Structure:

Maximum 7 days on primary
Minimum 14 days between rotations
At least 5 engineers per rotation
Clear backup scheduling

Compensation:

Base on-call stipend
Per-incident compensation
Time off after major incidents
Tool and training budget

Support Systems:

Mental health resources
Flexible scheduling
Training programs
Regular rotation reviews

Measuring Success

How do you know if your on-call culture is healthy? Here are our key metrics:

Technical Metrics:

Mean Time Between Failures (MTBF)
Mean Time To Resolution (MTTR)
Number of night pages
False positive rate
Repeat incident rate

Human Metrics:

Engineer satisfaction scores
On-call stress levels
Training completion rates
Documentation quality scores
Post-incident review completion

Business Metrics:

Customer impact duration
Revenue impact from incidents
Time to customer notification
Resolution SLA compliance

The Path Forward

Remember: The goal isn’t to handle every incident perfectly. It’s to learn from each one while maintaining your sanity and your sleep schedule.

Start with these steps:

Audit your current state
Identify the biggest pain points
Implement quick wins first
Build long-term improvement plans
Regularly review and adjust

Your pager will go off again. But with these principles in mind, you’ll handle it with confidence, clarity, and maybe even get back to sleep afterward.

And if you’re currently in a toxic on-call situation? Sometimes the best thing you can do is find a team that values sustainable engineering practices. Your health matters more than any system.

Prioritize sleep over heroics. Your future self will thank you.