MLOps Interview Questions 2026: The Real Bar

Back in 2015, a group of Google engineers wrote a paper that still quietly decides MLOps interviews. In Hidden Technical Debt in Machine Learning Systems (Sculley et al., NeurIPS 2015), the authors drew a now-famous diagram: the actual ML code is a tiny box, and the real system is the sprawl around it. Data collection, feature plumbing, serving, monitoring, configuration. Only a small fraction of a production ML system is the model itself.

That ratio is the whole MLOps interview. Most candidates I watch prepare spend their time on model math and Kaggle tricks. Then the interviewer asks how they’d catch a silent feature-pipeline break at 3am, and the room goes quiet. The MLOps interview isn’t checking whether you can train a model. It’s checking whether you can keep one alive.

I work on LastRound AI, so I sit in a lot of live technical rounds with candidates. That’s where you see which questions actually get asked, and where good answers fall apart. What follows isn’t a 30-item listicle. I cut it down on purpose. These are the questions that move the decision, grouped by what each one is really testing, with notes on what a strong answer sounds like next to a weak one.

The pattern that actually separates candidates

Here’s the thing we see over and over in live rounds. The definitional question rarely sinks anyone. “What’s a feature store?” is a warm-up. The follow-up is the test:

“Okay, you have a feature store. During a regional outage one feature’s freshness slips by 20 minutes. What does your serving layer return, and how would you even know it happened?”

Candidates who memorized the definition stall. Candidates who’ve actually operated the thing answer in about 30 seconds, usually starting with “depends whether that feature is online or offline.” That gap, definition versus operation, is the single biggest predictor of who passes. Prepare for the follow-ups, not the flashcards.

Pipelines: have you actually run one?

Pipeline questions are where interviewers separate people who’ve read about Airflow from people who’ve been paged by it. The tell is specificity. Weak answers describe the happy path (ingest, transform, train, deploy). Strong answers go straight to what breaks.

1. Walk me through an ML pipeline you’ve owned. Where did it break most often?

The second half is the real question. Good answers name a specific failure (upstream schema change, a silent null spike, a retraining job that ran on stale data) and how they detected it. If someone only describes the diagram, they’ve probably never been on call for one.
2. How do you validate data inside a pipeline before it reaches training?

Schema checks, distribution checks, and freshness checks, ideally as a gate that can halt the run. Naming a tool (Great Expectations, TFDV, or a homegrown check) is fine. What matters is that validation blocks bad data instead of just logging it.
3. A retraining job silently produced a worse model and it shipped. How does that not happen?

Offline evaluation gates, a champion-challenger comparison before promotion, and a rollback path. The strong version mentions that “worse” needs a metric the business agrees on, not just AUC.
4. Batch or streaming for this pipeline, and why?

There’s no universally right answer, which is the point. Tie it to a latency requirement and a cost number. “Streaming because fraud decisions need sub-second features, and we accept the extra infra cost” beats reciting the difference.
5. How do you handle a pipeline failure at 2am without a human awake?

Retries with backoff, idempotent steps so a re-run is safe, graceful degradation (serve the last good model), and an alert that’s actionable rather than noisy. This one quietly tests whether you’ve built for the night you’re not watching.
6. Design a self-serve pipeline platform for several data science teams.

Senior territory. Resource isolation, cost attribution per team, shared infra with guardrails, and templates so teams don’t each reinvent orchestration. The trap is over-building. A good answer starts small and names what it would not build yet.

Serving and deployment: latency meets blast radius

Deployment questions look like systems design, and they are, with one extra axis: a bad model deploy can be quietly wrong rather than loudly down. So interviewers push on how you’d limit damage when the thing technically works but predicts badly.

7. Compare canary, blue-green, and shadow deployment for a model. When would you pick each?

Shadow when you want real traffic with zero user risk (you log predictions, serve nothing). Canary when you’re fairly confident and want a small live blast radius. Blue-green when you need instant rollback. The senior signal is saying you’d shadow first, then canary.
8. Real-time inference at low latency. What actually moves the number?

Model-side: quantization, distillation, batching at the server. System-side: caching, colocating features, avoiding a network hop to the feature store on the hot path. The weak answer lists every option. The strong one profiles first and names the bottleneck.
9. How do you version a model so a rollback is boring?

Immutable model artifacts, the training data snapshot or its hash, the feature definitions, and the serving code, all pinned together. If you can roll back the model but not the features it expects, you don’t really have a rollback.
10. Training-serving skew bit you in production. What caused it and how do you prevent it?

Almost always feature logic that differs between the training job and the serving path. Prevention is a shared feature definition (this is the actual reason feature stores exist) plus a test that computes a feature both ways and diffs them.

Drift and monitoring: the part most people fake

This is the section where memorized answers are easiest to spot. Everyone can say “data drift, concept drift.” Far fewer can say what they’d alert on, at what threshold, and why most drift alerts are useless.

11. What do you actually monitor in a production model, in priority order?

Start with the business metric, then prediction distribution, then input data quality, then infra. The order matters. A model can have perfectly stable inputs and still be quietly losing money, and the reverse alarms people for no reason.
12. Your drift detector fires constantly and everyone ignores it now. Fix it.

This is my favorite question to hear answered well. The fix usually isn’t a better statistical test. It’s tying the alert to outcome (did performance actually drop?) instead of input movement, plus severity tiers so a 2% shift doesn’t page anyone. Alert fatigue is an MLOps problem, not just an ops one.
13. How do you detect drift when ground-truth labels arrive days or weeks late?

The honest answer is you proxy it. Monitor input and prediction distributions now, set up delayed performance backfills for when labels land, and accept the lag. Candidates who claim they can measure accuracy in real time without labels usually haven’t shipped a model with delayed feedback.

CI/CD for ML: where the senior signal hides

Plenty of people can describe CI/CD for a web app. The MLOps twist is that you’re shipping a pipeline that produces an artifact, not just code. Get this distinction crisp and you read as senior.

14. How is CI/CD for ML different from CI/CD for regular software?

In normal CI/CD you test and ship code. In ML you also have to test data, validate the trained model against a quality bar, and often run an online phase (canary or A/B) before full promotion. You’re deploying the system that builds the model, not only the model.
15. What goes in the automated test suite for an ML pipeline?

Unit tests for feature code, a data validation stage, a model quality gate against a held-out set, and a smoke test that the serving container loads the artifact and returns a sane prediction. Bonus points for a test that catches training-serving skew.

The bar moves a lot by seniority

One reason MLOps interviews feel inconsistent is that “MLOps” means different things at different levels. Google Cloud’s MLOps maturity guide is a useful map here. It splits maturity into three levels: level 0 is fully manual, level 1 automates the training pipeline for continuous training, and level 2 automates the CI/CD around the pipeline itself.

In interviews that maps cleanly. Entry roles want you solid at level 0 to 1: can you take a manual process and automate the retraining loop? Senior and staff roles push you to level 2 and beyond: can you make the pipeline itself shippable, tested, and safe to change daily? If you’re interviewing for a senior MLOps role and your answers stay at “I’d schedule a retraining job,” you’re answering a level-1 question in a level-2 room.

Weak answer sounds like

Lists every tool and pattern without picking one
Describes the diagram, never the failure
“I’d monitor for drift” with no threshold or action
Designs the most complex system on the whiteboard

Strong answer sounds like

Picks an approach and names the tradeoff out loud
Leads with a real incident and how it got caught
Ties every alert to an outcome and a severity
Starts simple, then says what it would not build yet

How to answer the open-ended design ones

For the big “design an MLOps system for X” prompts, skip the memorized acronym. Interviewers can smell a framework recited from a blog. Do this instead, in roughly this order:

Pin down scale and latency first. Requests per second, data volume, how fresh predictions need to be. The whole design hinges on these, and asking shows seniority.
State the one constraint that dominates. Usually latency, cost, or label delay. Name it, and design around it openly.
Sketch the simplest thing that works. Then add complexity only where you can point at the requirement that forces it.
Say how it fails and how you’d know. This is the part juniors skip and seniors lead with.

Practice the follow-ups, not the flashcards

The questions that decide MLOps rounds are the ones after the definition. LastRound AI sits in with you during live technical interviews and helps you structure answers to the production-failure follow-ups in real time, instead of freezing on them.

If you only fix one thing before your next MLOps interview, make it this: for every concept you can define, have a story about the time it broke. The definition gets you to the follow-up. The story gets you the offer. So which of these questions can you answer with a real incident, and which ones are still just flashcards?

The MLOps interview questions that decide the round

The pattern that actually separates candidates

Pipelines: have you actually run one?

Serving and deployment: latency meets blast radius

Drift and monitoring: the part most people fake

CI/CD for ML: where the senior signal hides

The bar moves a lot by seniority

Weak answer sounds like

Strong answer sounds like

How to answer the open-ended design ones

Practice the follow-ups, not the flashcards

Related reading

Leave a Reply Cancel reply

The MLOps interview questions that decide the round

The pattern that actually separates candidates

Pipelines: have you actually run one?

Serving and deployment: latency meets blast radius

Drift and monitoring: the part most people fake

CI/CD for ML: where the senior signal hides

The bar moves a lot by seniority

Weak answer sounds like

Strong answer sounds like

How to answer the open-ended design ones

Practice the follow-ups, not the flashcards

Related reading

Keep reading

Leave a Reply Cancel reply