Machine Learning

Your ML Model Passed All Tests. Then It Failed in Production.

Model evaluation: 94% accuracy. Production: wrong predictions everywhere. Your model is fine. Your features are lying to you.

5 min read
MLFeature StoresData Warehousing

The Model That Worked Too Well

You have two teams, two codebases, one model.

They're supposed to agree. They never do.

Data science builds features in batch SQL. Engineering re-implements them in production code. The model trains on one definition of truth and serves on another.

94% accuracy on the holdout set. You deploy to production. Two weeks later, the predictions are catastrophically wrong.


I've seen this exact failure mode across multiple organizations. Their ML models performed perfectly in evaluation but produced unexpected results in production.

The evaluation worked because the training features were computed correctly. Production broke because the serving features were computed differently.

The problem wasn't their model. It was their feature architecture.

Your Model Is Fine. Your Features Are Lying

Here's the uncomfortable truth: most ML production failures have nothing to do with model quality. They're feature transformation failures in disguise.

And the root cause isn't just technical — it's organizational.

Data science and data engineering operate as separate fiefdoms. Each has its own roadmap, its own tech stack, its own definition of "truth." Features get implemented twice in two different codebases, and everyone assumes the implementations match. They never do.

The symptoms look like model problems:

  • Wrong predictions
  • Performance degradation
  • Unexpected outputs
  • Compliance failures

But the root cause is the dual implementation itself.

The Silent Killer: Dual Feature Implementations

Your features follow a journey that looks like this:

Feature transformations are implemented twice — once in batch SQL for training, and again in production code for serving.

Data science builds features in the data warehouse. Engineering re-implements them for serving.

They're supposed to be the same. They never are.

What Kept Happening: Technical Breakdown

I spoke with a customer who described this exact problem. Their ML models showed strong evaluation metrics but failed in production.

The issue? Data science and engineering operated separately.

Feature transformations were implemented in two different codebases:

  • Training: Batch SQL in the data warehouse
  • Serving: Production code in a different language

Each implementation made different assumptions:

  • Null handling: SQL COALESCE vs application logic
  • Windowing: Batch OVER (PARTITION BY ...) vs streaming aggregation
  • Join semantics: Batch inner joins vs real-time lookups
  • Data freshness: Nightly batch jobs vs continuous updates

The model was trained on features computed one way. Production served features computed another way.

Every new model required ~100 features. Each feature had to be implemented twice. R&D cycles took 6-9 months.

The customer quote stuck with me: "Data science and engineering operate separately, leading to inconsistent feature transformations between offline training and online serving. This causes mismatched 'truths,' null-handling issues, and compliance risk."

Why This Is Structurally Hard

This isn't an engineering failure. It's an architectural gap.

Batch systems and online systems have fundamentally different execution models.

Batch warehouses are optimized for large-scale SQL aggregations. AVG(revenue) OVER (PARTITION BY user ORDER BY timestamp ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) works beautifully when you're processing terabytes in a nightly job.

Online serving requires low-latency lookups. That same 30-day rolling average has to be computed in real-time from a stream or pre-materialized in a feature store.

Most organizations don't have a feature store. They have two implementations that drift apart over time.

The Real Cost

Every new model required ~100 features, each implemented twice. R&D cycles stretched to 6-9 months — not because the models were hard to build, but because reconciling feature logic between two codebases took that long.

Every time you ship a model, you're betting that two separate codebases produce identical feature values. They won't.

The Fix: Unified Feature Transformation

The solution isn't "write better code." It's a fundamentally different architecture.

What You Have Now (Dual Implementation)

Data science implements features in SQL. Engineering re-implements in Python/Java/Go. The implementations diverge. The model fails.

What You Need (Unified Transformation)

Feature transformations are written once. The feature store handles both batch and serving. Training and inference use identical feature logic.

The Key Difference

Dual implementation: Features are computed separately in batch and serving. Inevitable divergence causes skew.

Unified transformation: Features are defined once. The feature store handles the execution details for both batch training and online serving.

How to Get Started

Find the features that show the largest distribution drift between your training dataset and your serving traffic — that's where skew is hiding, and that's where to start unifying.

The Hard Truth

Your model isn't the problem. Your features are lying to it.

Until transformation logic is unified — write once, run in both batch and streaming — every model deployment carries latent skew risk.

The divide isn't just technical. It's organizational.

Data science and engineering need to share a single feature transformation framework. Not separate codebases. Not "we'll sync up periodically." One definition of truth, two execution modes.

The fix isn't a better model. It's a better feature architecture.


P.S. Feature stores and unified transformation layers exist precisely to solve this problem. The technology is available. The bottleneck is organizational alignment between the teams that build models and the teams that serve them.

Tags

MLFeature StoresData Warehousing

More Like This

Apr 2026
That Time Your Pipeline Ran Successfully and Deleted 75% of Your Data
Your DAG completed. No errors. Success metrics green. Then your dashboard showed 75% fewer records than yesterday. Here's what happened — and why it kept happening.
Apr 2026
Why Your Flawless AI Demo Failed in Production
Your AI demo was flawless. The model answered every question. Stakeholders approved the budget. You deployed to production. Two weeks later, it's falling apart. Here's why — and it's not the model.
Mar 2026
Why Every LLM Security Tool Misses Multi-Turn Attacks — And What That Costs You
Stateless tools score 0% on progressive extraction, rephrased blocked attempts, and cross-agent attacks. Here's why the architecture is the problem, and what a stateful approach looks like.

Get in Touch

Have a question or want to connect? Feel free to reach out.