The Model That Worked Too Well
You have two teams, two codebases, one model.
They're supposed to agree. They never do.
Data science builds features in batch SQL. Engineering re-implements them in production code. The model trains on one definition of truth and serves on another.
94% accuracy on the holdout set. You deploy to production. Two weeks later, the predictions are catastrophically wrong.
I've seen this exact failure mode across multiple organizations. Their ML models performed perfectly in evaluation but produced unexpected results in production.
The evaluation worked because the training features were computed correctly. Production broke because the serving features were computed differently.
The problem wasn't their model. It was their feature architecture.
Your Model Is Fine. Your Features Are Lying
Here's the uncomfortable truth: most ML production failures have nothing to do with model quality. They're feature transformation failures in disguise.
And the root cause isn't just technical — it's organizational.
Data science and data engineering operate as separate fiefdoms. Each has its own roadmap, its own tech stack, its own definition of "truth." Features get implemented twice in two different codebases, and everyone assumes the implementations match. They never do.
The symptoms look like model problems:
- Wrong predictions
- Performance degradation
- Unexpected outputs
- Compliance failures
But the root cause is the dual implementation itself.
The Silent Killer: Dual Feature Implementations
Your features follow a journey that looks like this:
Feature transformations are implemented twice — once in batch SQL for training, and again in production code for serving.
Data science builds features in the data warehouse. Engineering re-implements them for serving.
They're supposed to be the same. They never are.
What Kept Happening: Technical Breakdown
I spoke with a customer who described this exact problem. Their ML models showed strong evaluation metrics but failed in production.
The issue? Data science and engineering operated separately.
Feature transformations were implemented in two different codebases:
- Training: Batch SQL in the data warehouse
- Serving: Production code in a different language
Each implementation made different assumptions:
- Null handling: SQL
COALESCEvs application logic - Windowing: Batch
OVER (PARTITION BY ...)vs streaming aggregation - Join semantics: Batch inner joins vs real-time lookups
- Data freshness: Nightly batch jobs vs continuous updates
The model was trained on features computed one way. Production served features computed another way.
Every new model required ~100 features. Each feature had to be implemented twice. R&D cycles took 6-9 months.
The customer quote stuck with me: "Data science and engineering operate separately, leading to inconsistent feature transformations between offline training and online serving. This causes mismatched 'truths,' null-handling issues, and compliance risk."
Why This Is Structurally Hard
This isn't an engineering failure. It's an architectural gap.
Batch systems and online systems have fundamentally different execution models.
Batch warehouses are optimized for large-scale SQL aggregations. AVG(revenue) OVER (PARTITION BY user ORDER BY timestamp ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) works beautifully when you're processing terabytes in a nightly job.
Online serving requires low-latency lookups. That same 30-day rolling average has to be computed in real-time from a stream or pre-materialized in a feature store.
Most organizations don't have a feature store. They have two implementations that drift apart over time.
The Real Cost
Every new model required ~100 features, each implemented twice. R&D cycles stretched to 6-9 months — not because the models were hard to build, but because reconciling feature logic between two codebases took that long.
Every time you ship a model, you're betting that two separate codebases produce identical feature values. They won't.
The Fix: Unified Feature Transformation
The solution isn't "write better code." It's a fundamentally different architecture.
What You Have Now (Dual Implementation)
Data science implements features in SQL. Engineering re-implements in Python/Java/Go. The implementations diverge. The model fails.
What You Need (Unified Transformation)
Feature transformations are written once. The feature store handles both batch and serving. Training and inference use identical feature logic.
The Key Difference
Dual implementation: Features are computed separately in batch and serving. Inevitable divergence causes skew.
Unified transformation: Features are defined once. The feature store handles the execution details for both batch training and online serving.
How to Get Started
Find the features that show the largest distribution drift between your training dataset and your serving traffic — that's where skew is hiding, and that's where to start unifying.
The Hard Truth
Your model isn't the problem. Your features are lying to it.
Until transformation logic is unified — write once, run in both batch and streaming — every model deployment carries latent skew risk.
The divide isn't just technical. It's organizational.
Data science and engineering need to share a single feature transformation framework. Not separate codebases. Not "we'll sync up periodically." One definition of truth, two execution modes.
The fix isn't a better model. It's a better feature architecture.
P.S. Feature stores and unified transformation layers exist precisely to solve this problem. The technology is available. The bottleneck is organizational alignment between the teams that build models and the teams that serve them.
Tags
More Like This
Get in Touch
Have a question or want to connect? Feel free to reach out.