System Design

Why Architecture Matters: Lessons from a Production Outage

A deep dive into how a single architectural decision caused a weekend-long outage and what we learned.

March 10, 2024

5 min read

System DesignPostgreSQLNext.jsKubernetes

The Incident

It was 2 AM on a Saturday. My phone buzzed with PagerDuty alerts. Our production database CPU was at 100%. Customer-facing services were timing out. The team scrambled to respond.

The root cause? A single architectural decision made six months earlier.

This is the story of how we broke production, the scramble to fix it, and the lessons that transformed how we approach system design.

The Decision That Seemed Right at the Time

We were building a real-time analytics dashboard. The requirements:

Show live metrics for 100K+ users
Sub-second query response times
Support complex aggregations

The obvious solution? PostgreSQL with lots of indexes.

We justified it:

PostgreSQL is battle-tested
We already used it elsewhere
Indexes would make queries fast
"We can optimize later if needed"

What We Missed

Write Amplification

Every user interaction required updating multiple tables:

-- User performs action
UPDATE user_stats SET actions = actions + 1 WHERE user_id = ?;
UPDATE daily_metrics SET count = count + 1 WHERE date = ?;
UPDATE feature_usage SET count = count + 1 WHERE feature = ?;
-- ... and 5 more similar updates

Each update triggered index maintenance. With 10K+ writes per second, the database spent more time maintaining indexes than serving queries.

Lock Contention

Hot rows in the metrics tables created lock contention. Multiple processes competed to update the same statistics:

Process 1: UPDATE daily_metrics WHERE date = '2024-03-10'
Process 2: UPDATE daily_metrics WHERE date = '2024-03-10'  -- Blocked!
Process 3: UPDATE daily_metrics WHERE date = '2024-03-10'  -- Blocked!

Connection Pool Exhaustion

As queries slowed down, our connection pool filled up waiting for database responses. New connections couldn't be established, creating a cascading failure.

The Weekend From Hell

Saturday 2:00 AM - Initial Response

Restarted database services
Cleared slow query logs
Thought we fixed it

Saturday 6:00 AM - It's Back

Same symptoms reappeared
Traffic from waking users increased load
Database CPU pinned at 100%

Saturday 10:00 AM - The Bad News

Read replicas couldn't keep up with replication lag
Failover would cause data loss
We had to keep the primary running and degraded

Saturday 2:00 PM - Mitigation

Disabled real-time analytics features
Cached everything aggressively
Begged users to bear with us

Sunday - The Fix

Implemented a stopgap solution:

// Buffer writes in Redis
await redis.lpush('analytics_events', JSON.stringify(event));

// Batch insert every 5 seconds
setInterval(async () => {
  const events = await redis.lrange('analytics_events', 0, 1000);
  await db.query('INSERT INTO analytics_events (data) VALUES ($1)', [events]);
  await redis.ltrim('analytics_events', 1000, -1);
}, 5000);

What We Should Have Done

The Right Architecture

From day one, we should have used an event-driven architecture:

[User Action] → [Event Stream] → [Kafka] → [Analytics Processor]
                                                    ↓
                                            [ClickHouse / TimescaleDB]

Benefits:

Writes are append-only (no lock contention)
Stream processing handles backpressure
Time-series databases are optimized for this workload
Database isolation (analytics can't break main app)

Alternative Approaches

Option 1: Time-Series Database

InfluxDB or TimescaleDB
Purpose-built for metrics
Better compression and query performance

Option 2: Event Sourcing

Store events, not state
Rebuild analytics views on demand
Natural fit for real-time systems

Option 3: CQRS Pattern

Separate read and write models
Optimize each for its use case
Asynchronous eventual consistency

The Recovery Plan

After that weekend, we didn't just patch the system. We rebuilt it:

Immediate: Event buffering with Redis (stopgap)
Week 1: Added Kafka to the architecture
Week 2: Migrated analytics to TimescaleDB
Week 3: Removed the old metrics tables from PostgreSQL
Week 4: Added monitoring and alerts for write amplification

Lessons Learned

1. Read Patterns ≠ Write Patterns

A database optimized for fast reads (with lots of indexes) is often terrible for high-throughput writes. Never optimize for one at the expense of the other.

2. Scale Writes, Don't Optimize Them

Instead of trying to make a single database handle all writes, design your system to distribute them:

Event streams
Sharded databases
Time-series data stores
Append-only logs

3. Test at Scale

Our load tests used 100 users, not 100K. The architecture worked fine at small scale but collapsed at production volumes. Test realistically or pay the price.

4. Architectural Decisions Are Hard to Change

The decision to use PostgreSQL for analytics was easy to make but incredibly difficult to undo. Consider the exit strategy before committing.

5. Every Service Has Its Specialty

PostgreSQL is great at many things, but real-time high-write analytics isn't one of them. Use the right tool for the job.

A Framework for Making Architectural Decisions

Now, when faced with architecture decisions, I ask:

What's the primary use case? (Read vs. Write)
What's the expected scale? (Users, QPS, data volume)
What's the fallback plan? (If this doesn't work, then what?)
What are the failure modes? (Single point of failure? Cascading failures?)
Can we test this at scale? (Realistic load testing)
What's the exit strategy? (How do we migrate away if needed?)

The Silver Lining

That weekend was miserable, but it changed how our entire team thinks about architecture. We're now:

More skeptical of "obvious" solutions
Better at designing for failure
More proactive about testing
Comfortable with event-driven patterns

Final Thoughts

Architecture matters because these decisions compound. A small mistake doesn't just cause a bug; it shapes everything you build afterwards.

The weekend we spent fixing production could have been avoided with better upfront thinking. But as the saying goes: Good decisions come from experience. Experience comes from bad decisions.

At least we got a great story out of it.

More Like This

Apr 2026

That Time Your Pipeline Ran Successfully and Deleted 75% of Your Data

Your DAG completed. No errors. Success metrics green. Then your dashboard showed 75% fewer records than yesterday. Here's what happened — and why it kept happening.

Get in Touch

Have a question or want to connect? Feel free to reach out.

raushansingh116@gmail.com linkedin.com/in/singhraushan github.com/raushan-s