The Incident
It was 2 AM on a Saturday. My phone buzzed with PagerDuty alerts. Our production database CPU was at 100%. Customer-facing services were timing out. The team scrambled to respond.
The root cause? A single architectural decision made six months earlier.
This is the story of how we broke production, the scramble to fix it, and the lessons that transformed how we approach system design.
The Decision That Seemed Right at the Time
We were building a real-time analytics dashboard. The requirements:
- Show live metrics for 100K+ users
- Sub-second query response times
- Support complex aggregations
The obvious solution? PostgreSQL with lots of indexes.
We justified it:
- PostgreSQL is battle-tested
- We already used it elsewhere
- Indexes would make queries fast
- "We can optimize later if needed"
What We Missed
Write Amplification
Every user interaction required updating multiple tables:
-- User performs action
UPDATE user_stats SET actions = actions + 1 WHERE user_id = ?;
UPDATE daily_metrics SET count = count + 1 WHERE date = ?;
UPDATE feature_usage SET count = count + 1 WHERE feature = ?;
-- ... and 5 more similar updates
Each update triggered index maintenance. With 10K+ writes per second, the database spent more time maintaining indexes than serving queries.
Lock Contention
Hot rows in the metrics tables created lock contention. Multiple processes competed to update the same statistics:
Process 1: UPDATE daily_metrics WHERE date = '2024-03-10'
Process 2: UPDATE daily_metrics WHERE date = '2024-03-10' -- Blocked!
Process 3: UPDATE daily_metrics WHERE date = '2024-03-10' -- Blocked!
Connection Pool Exhaustion
As queries slowed down, our connection pool filled up waiting for database responses. New connections couldn't be established, creating a cascading failure.
The Weekend From Hell
Saturday 2:00 AM - Initial Response
- Restarted database services
- Cleared slow query logs
- Thought we fixed it
Saturday 6:00 AM - It's Back
- Same symptoms reappeared
- Traffic from waking users increased load
- Database CPU pinned at 100%
Saturday 10:00 AM - The Bad News
- Read replicas couldn't keep up with replication lag
- Failover would cause data loss
- We had to keep the primary running and degraded
Saturday 2:00 PM - Mitigation
- Disabled real-time analytics features
- Cached everything aggressively
- Begged users to bear with us
Sunday - The Fix
Implemented a stopgap solution:
// Buffer writes in Redis
await redis.lpush('analytics_events', JSON.stringify(event));
// Batch insert every 5 seconds
setInterval(async () => {
const events = await redis.lrange('analytics_events', 0, 1000);
await db.query('INSERT INTO analytics_events (data) VALUES ($1)', [events]);
await redis.ltrim('analytics_events', 1000, -1);
}, 5000);
What We Should Have Done
The Right Architecture
From day one, we should have used an event-driven architecture:
[User Action] → [Event Stream] → [Kafka] → [Analytics Processor]
↓
[ClickHouse / TimescaleDB]
Benefits:
- Writes are append-only (no lock contention)
- Stream processing handles backpressure
- Time-series databases are optimized for this workload
- Database isolation (analytics can't break main app)
Alternative Approaches
Option 1: Time-Series Database
- InfluxDB or TimescaleDB
- Purpose-built for metrics
- Better compression and query performance
Option 2: Event Sourcing
- Store events, not state
- Rebuild analytics views on demand
- Natural fit for real-time systems
Option 3: CQRS Pattern
- Separate read and write models
- Optimize each for its use case
- Asynchronous eventual consistency
The Recovery Plan
After that weekend, we didn't just patch the system. We rebuilt it:
- Immediate: Event buffering with Redis (stopgap)
- Week 1: Added Kafka to the architecture
- Week 2: Migrated analytics to TimescaleDB
- Week 3: Removed the old metrics tables from PostgreSQL
- Week 4: Added monitoring and alerts for write amplification
Lessons Learned
1. Read Patterns ≠ Write Patterns
A database optimized for fast reads (with lots of indexes) is often terrible for high-throughput writes. Never optimize for one at the expense of the other.
2. Scale Writes, Don't Optimize Them
Instead of trying to make a single database handle all writes, design your system to distribute them:
- Event streams
- Sharded databases
- Time-series data stores
- Append-only logs
3. Test at Scale
Our load tests used 100 users, not 100K. The architecture worked fine at small scale but collapsed at production volumes. Test realistically or pay the price.
4. Architectural Decisions Are Hard to Change
The decision to use PostgreSQL for analytics was easy to make but incredibly difficult to undo. Consider the exit strategy before committing.
5. Every Service Has Its Specialty
PostgreSQL is great at many things, but real-time high-write analytics isn't one of them. Use the right tool for the job.
A Framework for Making Architectural Decisions
Now, when faced with architecture decisions, I ask:
- What's the primary use case? (Read vs. Write)
- What's the expected scale? (Users, QPS, data volume)
- What's the fallback plan? (If this doesn't work, then what?)
- What are the failure modes? (Single point of failure? Cascading failures?)
- Can we test this at scale? (Realistic load testing)
- What's the exit strategy? (How do we migrate away if needed?)
The Silver Lining
That weekend was miserable, but it changed how our entire team thinks about architecture. We're now:
- More skeptical of "obvious" solutions
- Better at designing for failure
- More proactive about testing
- Comfortable with event-driven patterns
Final Thoughts
Architecture matters because these decisions compound. A small mistake doesn't just cause a bug; it shapes everything you build afterwards.
The weekend we spent fixing production could have been avoided with better upfront thinking. But as the saying goes: Good decisions come from experience. Experience comes from bad decisions.
At least we got a great story out of it.
Tags
Get in Touch
Have a question or want to connect? Feel free to reach out.