Data Engineering

That Time Your Pipeline Ran Successfully and Deleted 75% of Your Data

Your DAG completed. No errors. Success metrics green. Then your dashboard showed 75% fewer records than yesterday. Here's what happened — and why it kept happening.

4 min read
AirflowData WarehousingOrchestration

The Dashboard Incident

9 AM. You pull up the analytics dashboard.

Yesterday: 1.1 million records. Today: 271,000.

You check the pipeline logs. All DAGs completed. No errors. Success metrics green. Upstream jobs show success.

So where did 75% of your data go?

You dig into the records. 818,902 out of 1,090,125 rows flagged as "deleted." But you know that's wrong — the business hasn't changed. The data source hasn't changed.

Your pipeline succeeded. Your data is corrupt.

Same outcome, different cause: external SaaS vendor makes an unannounced API change. No migration guide, no deprecation notice — the contract just changes. Your pipeline starts failing silently. Not failing as in "error in logs" — failing as in "API returns 200 OK with empty response payloads." For two to three days, your pipeline runs successfully, loads empty data, overwrites production tables. You only find out when users complain that dashboards look wrong. The vendor changed their API. Your code didn't break. The data did.


This isn't a hypothetical. I've seen both failure modes four times in two months.

One pattern: orchestration platform migration causes intermittent timeouts. Upstream DAG fails partway through. Downstream DAG has no dependency gate. It runs anyway, sees incomplete data, and overwrites production with partial records.

Another pattern: external dependency changes contract. API returns 200 OK with empty payloads. Pipeline succeeds. Data is garbage.

The logs show success. The data shows catastrophic loss.

The Uncomfortable Truth

Most data platforms assume that if a pipeline completes, the data is good.

That assumption is wrong.

A successful DAG run tells you nothing about data quality. It tells you the code executed. It doesn't tell you whether the data it processed was complete, valid, or sane.

When you design pipelines around execution success rather than data validation, you build systems that silently corrupt themselves.

What Kept Happening

Here's the technical breakdown of the incidents I saw:

Orchestration platform migration — the team moved from on-prem Airflow to a cloud-managed orchestration platform. New environment, different network patterns, connection timeouts that didn't exist before.

Upstream DAG failures — the source extraction DAG started timing out intermittently. Sometimes it loaded 100% of the data. Sometimes 40%. Sometimes 70%. The failures were silent — no exception thrown, just incomplete data written to staging.

No dependency gate — the downstream transform DAG had no data-quality check. It simply read from staging and loaded to production. The assumption: if staging exists, it's ready.

Destructive overwrite semantics — the load used replace=True. Truncate table, load new data, commit. If the new data was incomplete, too bad. The good data from yesterday was gone.

Customer-reported detection — how did we find out? Customers noticed. The monitoring dashboards didn't catch it. The data-quality checks didn't catch it. Users saw missing records and opened support tickets.

Here's what that failure architecture looks like:

The DAG runs. The logs show success. The data is wrong.

What Was Eventually Done

The team implemented a threshold check: if more than 20-25% of records are flagged as deleted, fail the pipeline and alert.

That stopped the catastrophic data loss. But it's a band-aid.

The real fix requires rethinking how pipelines validate data.

The core issue: your orchestrator tells you the code ran, not whether the data is good.

Before/After: The Fix

What You Have Now

Pipeline success ≠ data quality.

What You Need

Every stage validates before proceeding.

The Hard Truth

Your pipeline isn't healthy just because it completes.

A successful DAG run tells you nothing about whether your data is complete, valid, or sane. If you design pipelines around execution success rather than data validation, you will build systems that silently corrupt themselves.

The fix isn't better monitoring. It's validation contracts between every stage.

Record counts. Completeness checks. Range validations. Statistical anomaly detection.

Your DAG completes? Great. Now prove the data is good.


P.S. The 20-25% deletion threshold check stopped the catastrophic failures. But the team's still planning a pipeline health monitoring dashboard and exploring an AI-powered DAG failure resolution agent. The real work is designing systems that validate data at every stage, not just when things break.

Tags

AirflowData WarehousingOrchestration

More Like This

Apr 2026
Your ML Model Passed All Tests. Then It Failed in Production.
Model evaluation: 94% accuracy. Production: wrong predictions everywhere. Your model is fine. Your features are lying to you.
Apr 2026
Why Your Flawless AI Demo Failed in Production
Your AI demo was flawless. The model answered every question. Stakeholders approved the budget. You deployed to production. Two weeks later, it's falling apart. Here's why — and it's not the model.
Mar 2026
Why Every LLM Security Tool Misses Multi-Turn Attacks — And What That Costs You
Stateless tools score 0% on progressive extraction, rephrased blocked attempts, and cross-agent attacks. Here's why the architecture is the problem, and what a stateful approach looks like.

Get in Touch

Have a question or want to connect? Feel free to reach out.