System Design

Why Architecture Matters: Lessons from a Production Outage

A deep dive into how a single architectural decision caused a weekend-long outage and what we learned.

5 min read
System DesignPostgreSQLNext.jsKubernetes

The Incident

It was 2 AM on a Saturday. My phone buzzed with PagerDuty alerts. Our production database CPU was at 100%. Customer-facing services were timing out. The team scrambled to respond.

The root cause? A single architectural decision made six months earlier.

This is the story of how we broke production, the scramble to fix it, and the lessons that transformed how we approach system design.

The Decision That Seemed Right at the Time

We were building a real-time analytics dashboard. The requirements:

  • Show live metrics for 100K+ users
  • Sub-second query response times
  • Support complex aggregations

The obvious solution? PostgreSQL with lots of indexes.

We justified it:

  • PostgreSQL is battle-tested
  • We already used it elsewhere
  • Indexes would make queries fast
  • "We can optimize later if needed"

What We Missed

Write Amplification

Every user interaction required updating multiple tables:

-- User performs action
UPDATE user_stats SET actions = actions + 1 WHERE user_id = ?;
UPDATE daily_metrics SET count = count + 1 WHERE date = ?;
UPDATE feature_usage SET count = count + 1 WHERE feature = ?;
-- ... and 5 more similar updates

Each update triggered index maintenance. With 10K+ writes per second, the database spent more time maintaining indexes than serving queries.

Lock Contention

Hot rows in the metrics tables created lock contention. Multiple processes competed to update the same statistics:

Process 1: UPDATE daily_metrics WHERE date = '2024-03-10'
Process 2: UPDATE daily_metrics WHERE date = '2024-03-10'  -- Blocked!
Process 3: UPDATE daily_metrics WHERE date = '2024-03-10'  -- Blocked!

Connection Pool Exhaustion

As queries slowed down, our connection pool filled up waiting for database responses. New connections couldn't be established, creating a cascading failure.

The Weekend From Hell

Saturday 2:00 AM - Initial Response

  • Restarted database services
  • Cleared slow query logs
  • Thought we fixed it

Saturday 6:00 AM - It's Back

  • Same symptoms reappeared
  • Traffic from waking users increased load
  • Database CPU pinned at 100%

Saturday 10:00 AM - The Bad News

  • Read replicas couldn't keep up with replication lag
  • Failover would cause data loss
  • We had to keep the primary running and degraded

Saturday 2:00 PM - Mitigation

  • Disabled real-time analytics features
  • Cached everything aggressively
  • Begged users to bear with us

Sunday - The Fix

Implemented a stopgap solution:

// Buffer writes in Redis
await redis.lpush('analytics_events', JSON.stringify(event));

// Batch insert every 5 seconds
setInterval(async () => {
  const events = await redis.lrange('analytics_events', 0, 1000);
  await db.query('INSERT INTO analytics_events (data) VALUES ($1)', [events]);
  await redis.ltrim('analytics_events', 1000, -1);
}, 5000);

What We Should Have Done

The Right Architecture

From day one, we should have used an event-driven architecture:

[User Action] → [Event Stream] → [Kafka] → [Analytics Processor]
                                                    ↓
                                            [ClickHouse / TimescaleDB]

Benefits:

  • Writes are append-only (no lock contention)
  • Stream processing handles backpressure
  • Time-series databases are optimized for this workload
  • Database isolation (analytics can't break main app)

Alternative Approaches

Option 1: Time-Series Database

  • InfluxDB or TimescaleDB
  • Purpose-built for metrics
  • Better compression and query performance

Option 2: Event Sourcing

  • Store events, not state
  • Rebuild analytics views on demand
  • Natural fit for real-time systems

Option 3: CQRS Pattern

  • Separate read and write models
  • Optimize each for its use case
  • Asynchronous eventual consistency

The Recovery Plan

After that weekend, we didn't just patch the system. We rebuilt it:

  1. Immediate: Event buffering with Redis (stopgap)
  2. Week 1: Added Kafka to the architecture
  3. Week 2: Migrated analytics to TimescaleDB
  4. Week 3: Removed the old metrics tables from PostgreSQL
  5. Week 4: Added monitoring and alerts for write amplification

Lessons Learned

1. Read Patterns ≠ Write Patterns

A database optimized for fast reads (with lots of indexes) is often terrible for high-throughput writes. Never optimize for one at the expense of the other.

2. Scale Writes, Don't Optimize Them

Instead of trying to make a single database handle all writes, design your system to distribute them:

  • Event streams
  • Sharded databases
  • Time-series data stores
  • Append-only logs

3. Test at Scale

Our load tests used 100 users, not 100K. The architecture worked fine at small scale but collapsed at production volumes. Test realistically or pay the price.

4. Architectural Decisions Are Hard to Change

The decision to use PostgreSQL for analytics was easy to make but incredibly difficult to undo. Consider the exit strategy before committing.

5. Every Service Has Its Specialty

PostgreSQL is great at many things, but real-time high-write analytics isn't one of them. Use the right tool for the job.

A Framework for Making Architectural Decisions

Now, when faced with architecture decisions, I ask:

  1. What's the primary use case? (Read vs. Write)
  2. What's the expected scale? (Users, QPS, data volume)
  3. What's the fallback plan? (If this doesn't work, then what?)
  4. What are the failure modes? (Single point of failure? Cascading failures?)
  5. Can we test this at scale? (Realistic load testing)
  6. What's the exit strategy? (How do we migrate away if needed?)

The Silver Lining

That weekend was miserable, but it changed how our entire team thinks about architecture. We're now:

  • More skeptical of "obvious" solutions
  • Better at designing for failure
  • More proactive about testing
  • Comfortable with event-driven patterns

Final Thoughts

Architecture matters because these decisions compound. A small mistake doesn't just cause a bug; it shapes everything you build afterwards.

The weekend we spent fixing production could have been avoided with better upfront thinking. But as the saying goes: Good decisions come from experience. Experience comes from bad decisions.

At least we got a great story out of it.

Tags

System DesignPostgreSQLNext.jsKubernetes

Get in Touch

Have a question or want to connect? Feel free to reach out.