SRE Interview Questions With Answers

The first time I got asked about error budgets in a loop at a top-five tech company, I froze. I’d memorized the definition from Google’s SRE book the night before. But the follow-up – “your error budget is 40% consumed and it’s only week two, what do you actually do?” – made clear that memorization wasn’t going to cut it. That’s the pattern with SRE interviews. Surface questions are easy. Follow-up questions reveal everything.

SRE roles pulled a US median salary of $166,500 in the 2024 Stack Overflow Developer Survey. Interviewers aren’t checking for definition recall. They’re checking for the kind of judgment that comes from having defended a reliability decision to a product manager at 2am.

Below are 14 questions that separate candidates who’ve been on-call from those who’ve only read about it.

SLOs, SLIs, and Error Budgets

1. What’s the difference between an SLI, SLO, and SLA?

An SLI (Service Level Indicator) is a specific metric you measure – request latency, error rate, availability. An SLO (Service Level Objective) is your internal target for that metric. An SLA (Service Level Agreement) is the contractual promise to customers, usually with financial penalties for breaches.

The common follow-up: “If our SLO is 99.9% and our SLA is 99.5%, what does that gap tell you?” You run the internal target tighter than the external promise so you have room before you owe customers money. Candidates who don’t know that haven’t worked with production SLAs.

2. How do you set a meaningful SLO?

You start with user pain, not historical data. What failure level actually causes users to churn or complain? The Google SRE book is direct about it: “100% is probably never the right reliability target.” Achieving 99.999% may cost 100x more engineering effort than 99.9%, and users on a mobile connection won’t notice the difference anyway.

I’d argue most teams set their first SLO by looking at what they’ve historically achieved, which is backwards. That tells you what you’ve delivered, not what users need.

3. Walk me through error budgets. What happens when you exhaust yours?

An error budget is the inverse of your SLO. If your SLO is 99.9% monthly availability, you have roughly 43.8 minutes of allowed downtime per month. Spend it on planned deployments, experiments, risk-taking. When it’s gone, new feature work stops and reliability investment takes priority.

What trips candidates up is the enforcement question. When the budget is exhausted, what’s the actual mechanism? At Google, SRE teams can refuse to support new launches until reliability improves. At most companies that authority doesn’t exist, and the answer becomes political. A strong answer acknowledges this: “In theory the policy freezes features. In practice, enforcement depends on whether leadership has explicitly granted SRE that authority.”

4. How do you define SLOs for a batch processing pipeline?

Availability isn’t the right SLI for batch jobs. A pipeline that processes 10,000 records per hour but starts 20 minutes late isn’t “down” – it’s delayed. Better SLIs for batch: job completion rate, throughput (records per hour), data freshness (how stale is the output), end-to-end latency. This question catches candidates who’ve only worked on request/response services.

Incident Response

5. Walk me through your P1 incident response process.

Interviewers want a specific sequence, not abstract principles. Solid answer: detect via alert (or customer report – be honest about which), declare the incident, assign an incident commander separate from the person debugging, send a stakeholder update within 15 minutes even if it’s “we’re investigating,” mitigate first then diagnose, write the timeline as you go.

The common trip-up: mixing mitigation and diagnosis. Rolling back before you’ve found root cause isn’t admitting defeat – it’s restoring user experience while you figure out what went wrong.

6. How do you run a blameless postmortem?

“Blameless” doesn’t mean consequence-free. It means the postmortem focuses on the system conditions that made an error possible rather than on the individual who triggered it. If a human mistake caused an outage, the real finding is that the system allowed that mistake to have a catastrophic blast radius.

A strong postmortem has: timeline, contributing factors (most outages have 4-7, not one), impact, action items with owners and deadlines, and lessons. Candidates who’ve run real postmortems often note that identifying what went wrong isn’t the hard part. Getting action items completed afterward is.

7. Describe a cascading failure and how you’d prevent it.

A cascading failure: one component fails, load increases on adjacent components, they fail too. Classic example – caching layer goes down, every request hits the database, database saturates, upstream services time out, users retry, load climbs further.

Prevention: circuit breakers, load shedding, bulkheads (isolate failure domains), and retry budgets with exponential backoff and jitter. Jitter matters specifically – synchronized retries from thousands of clients can recreate the original spike.

Monitoring and Observability

8. What’s the difference between monitoring and observability?

Monitoring tells you something is wrong. Observability lets you understand why. Monitoring works when failure modes are known in advance. Observability is the property of a system that lets you ask arbitrary questions about internal state from the outside – including failure modes you never predicted. Good monitoring catches the outage you expected. Good observability helps you debug the one you didn’t.

9. Explain USE and RED. When do you use each?

USE (Utilization, Saturation, Errors) is for infrastructure resources – CPUs, disks, network interfaces. RED (Rate, Errors, Duration) is for request-based services. USE first when debugging a hardware or OS-level bottleneck. RED first for a web service. Mixing them up isn’t catastrophic, but it signals you’re pattern-matching terms rather than reasoning from a model.

10. How do you design alerts that don’t burn out on-call engineers?

Alert fatigue is one of the most common failure modes in SRE practice, and I think it’s underweighted in how teams audit their own processes. An alert should be actionable, urgent, and tied to actual user impact. An internal metric crossing a threshold isn’t the same thing.

Alert on SLO burn rate rather than raw error rate. A 5% error spike at 3am on a service with near-zero traffic matters differently than the same spike during peak hours. Burn rate alerts give you time-to-exhaustion context, which makes 3am pages much easier to triage.

What shows up in LastRound AI mock interviews

Candidates practicing SRE loops on LastRound AI consistently handle the definitional questions well but struggle on scenario follow-ups: “your burn rate alert just fired, walk me through the next 10 minutes.” The pattern is consistent – the gap isn’t knowledge of SLOs, it’s operational decision-making under pressure. Practicing scenarios rather than definitions is where preparation time pays off most.

Capacity Planning and Automation

11. How do you approach capacity planning for unpredictable traffic?

Establish a baseline, then find the cliff edge: at what utilization does latency degrade or errors climb? Provision headroom above that cliff, not above the average. Autoscaling helps but isn’t instant – spinning up new instances can take 2-5 minutes, and a traffic spike during that window is still your problem. Load testing to find actual saturation is something teams skip until after their first bad incident.

12. How do you think about toil, and how much is acceptable?

Toil is manual, repetitive, automatable work that doesn’t produce permanent improvement. The SRE model targets below 50% of total engineering time. I’ll be honest – I don’t know of many teams that measure this rigorously. The number exists as a forcing function. If toil is consuming more than half your time, you’re doing operations, not reliability engineering.

13. What makes automation reliable?

Reliable automation is idempotent (running it twice gives the same result as once), observable (logs what it’s doing, emits metrics), and has a manual override. The override matters more than people give it credit for. Automation with no kill switch fails fast and at scale – worse than no automation.

14. How does infrastructure as code connect to reliability?

IaC makes infrastructure changes reviewable and reproducible. Reliability wins: drift detection, disaster recovery from source, and change management through review. The failure mode I see most: teams adopt IaC for new infrastructure, leave legacy resources managed manually. Two sources of truth. The inconsistency shows up during incidents when you can’t trust whether the code reflects reality.

How to Prepare

SRE interviews test operational judgment in a way that’s hard to fake. If you have on-call experience, practice articulating specific incidents: what you detected, what you did first, what you got wrong, what the postmortem found. Without on-call experience yet, scenario-based practice closes the gap faster than reviewing definitions. The system design interview guide covers distributed systems fundamentals that SRE loops draw from. For FAANG-style SRE roles specifically, the FAANG interview questions for 2026 covers the reliability tracks at those companies.

Practice SRE Scenarios Before the Real Loop

Run through SRE incident and SLO scenarios with LastRound AI’s mock interview tool so you know how you perform under pressure before the actual interview.

14 SRE Interview Questions That Reveal Operational Judgment

SLOs, SLIs, and Error Budgets

1. What’s the difference between an SLI, SLO, and SLA?

2. How do you set a meaningful SLO?

3. Walk me through error budgets. What happens when you exhaust yours?

4. How do you define SLOs for a batch processing pipeline?

Incident Response

5. Walk me through your P1 incident response process.

6. How do you run a blameless postmortem?

7. Describe a cascading failure and how you’d prevent it.

Monitoring and Observability

8. What’s the difference between monitoring and observability?

9. Explain USE and RED. When do you use each?

10. How do you design alerts that don’t burn out on-call engineers?

Capacity Planning and Automation

11. How do you approach capacity planning for unpredictable traffic?

12. How do you think about toil, and how much is acceptable?

13. What makes automation reliable?

14. How does infrastructure as code connect to reliability?

How to Prepare

Practice SRE Scenarios Before the Real Loop

Leave a Reply Cancel reply

14 SRE Interview Questions That Reveal Operational Judgment

SLOs, SLIs, and Error Budgets

1. What’s the difference between an SLI, SLO, and SLA?

2. How do you set a meaningful SLO?

3. Walk me through error budgets. What happens when you exhaust yours?

4. How do you define SLOs for a batch processing pipeline?

Incident Response

5. Walk me through your P1 incident response process.

6. How do you run a blameless postmortem?

7. Describe a cascading failure and how you’d prevent it.

Monitoring and Observability

8. What’s the difference between monitoring and observability?

9. Explain USE and RED. When do you use each?

10. How do you design alerts that don’t burn out on-call engineers?

Capacity Planning and Automation

11. How do you approach capacity planning for unpredictable traffic?

12. How do you think about toil, and how much is acceptable?

13. What makes automation reliable?

14. How does infrastructure as code connect to reliability?

How to Prepare

Practice SRE Scenarios Before the Real Loop

Keep reading

Leave a Reply Cancel reply