A model that achieves 0.85 Gini in a notebook is not a model in production. The gap between the two is where most ML programmes lose twelve to eighteen months — and where a regulator’s model risk review finds the deficiencies that block sign-off. We have shipped credit-decisioning models into national-scale production environments and watched both the discipline that works and the shortcuts that quietly cost capital downstream.
The problem is rarely the model itself. By the time a team is preparing for production, the algorithm is usually good enough. What breaks the deployment is the surrounding system: feature pipelines that drift between training and serving, validation regimes that catch statistical lift but miss business-impact regressions, and monitoring stacks that flag drift in dashboards no one is accountable for. Most production failures we are called in to diagnose are not “the model is wrong” failures — they are “the model is right but the operating environment around it isn’t” failures.
The disciplines that distinguish programmes that ship
The eight-week sequence below organises four phases of execution, but underneath it sit three disciplines that are the most consistent differentiators between ML programmes that go live and programmes that get stuck in the validation backlog.
Validation across three lenses.
Pre-agreed A/B thresholds.
Monitoring with ownership.
Underneath those three, the feature-foundation work in the first fortnight is what makes any of them possible. The hardest invariant to enforce later is consistency between the training-time feature definition and the serving-time feature definition; we put a feature store in place early so that invariant lives at the data layer, rather than surfacing six months in when the score starts drifting and no one can explain why.
Four phases, two weeks each
Data & feature foundation
Deliverable Feature pipeline + scalable feature store.
Rigorous model development & validation
Deliverable Champion + challenger with full audit documentation.
Compliant deployment & A/B testing
Deliverable Production API + A/B framework + rollback plan.
Continuous monitoring & governance
Deliverable Production-grade ML system with alerting and rollback.
What the framework does not assume is that the model you are training is the right model. Problem framing — what decision is the model actually informing, what counterfactual would the business take without it, what is the cost of being wrong — sits upstream of week one and is the single most under-invested step in production ML. The Production AI & ML pillar we run with clients includes that framing work; the eight-week framework picks up after it.
