System Design in 2026: Why the Questions Changed
If you studied for a system-design round in 2022, you practised designing Twitter, Uber, and a URL shortener. Those still come up. But more than half the senior rounds we see in 2026 include at least one LLM-flavoured question. "Design a RAG-backed support chatbot." "Design the infrastructure for a coding agent that can run for hours." "Design a vector search service for 10 billion documents."
The panels haven't thrown out the old playbook. They've added new pages to it. Here's what's actually new.
Question type 1: RAG-backed systems
"Design a customer-support chatbot that answers from our docs." Variants land at every FAANG and most well-funded startups. The panel is testing whether you can reason about a stack that mixes embedding pipelines, a vector store, an LLM provider, and a feedback loop.
The components they want to see on the board:
- An ingestion pipeline. Source docs → chunker → embedder → vector store. Run on doc updates. Idempotent on re-runs. Mention you'd chunk by semantic unit (markdown heading, code block) rather than by character count.
- A query path. User query → query embedder → vector search top-k → rerank → context-window assembly → LLM call → response. Streaming response back to the client.
- A feedback loop. Thumbs-up / thumbs-down per response, logged with the query, the retrieved chunks, and the LLM output. Used to retrain the reranker and to flag regressions.
The trade-offs they want you to articulate: embedding model latency vs quality (small models like bge-small-en run on CPU; large models like text-embedding- 3-large need GPU but win on retrieval recall). Vector store choice (Pinecone managed vs pgvector self-hosted vs Qdrant for hybrid search). Cache the embedding of common queries to skip the embed call entirely on the hot path.
Anthropic published a useful breakdown of agentic architectures in their "Building effective agents" post that's worth reading before any LLM-system-design round. It distinguishes workflows from agents and is roughly the framing panels expect.
Question type 2: vector search at scale
"We have 10 billion document embeddings. Users hit us with 100k queries per second. Design the storage and serving layer." This one separates candidates who've used a vector DB from candidates who've operated one.
The numbers matter. A 1536-dim float32 embedding is 6 KB. 10B embeddings is 60 TB. You're not fitting that in RAM on one box. You're sharding, and the sharding strategy is the interview.
- Pick the index: HNSW for in-memory low-latency, IVF for disk-backed lower-cost, or hybrid (HNSW per-shard, IVF for routing). HNSW gives you sub-10ms p99 reads but memory-bound. IVF is slower but fits more per node.
- Sharding by document ID hash gives even load but no query locality. Sharding by document namespace (per-tenant, per-language) preserves locality but creates hot shards. There's no right answer. There's a workload-shaped answer.
- Caching the top-k results for high-frequency queries gets you most of the win for free. Show you'd build this in.
Reference benchmarks are public. The ann-benchmarks project compares HNSW, IVF, ScaNN, and others across datasets. If you can quote one specific number from this ("HNSW gets 95% recall at 1k QPS on the GloVe-100 dataset"), the panel knows you've actually engaged with the trade-offs.
Question type 3: long-running agent infrastructure
This is the newest of the three, and the one fewest candidates handle well. "Design the infrastructure for a coding agent that runs for up to 8 hours, executes shell commands, writes files, and resumes from checkpoint if the host dies." Anthropic, OpenAI, Cognition (Devin), Cursor, and a growing list of startups ask versions of this.
The components:
- Per-agent isolated execution environment. Container (Firecracker VM or gVisor) per session. Resource limits enforced. Network policy default-deny except to the model API and a whitelisted set of registries.
- Checkpoint store. Agent state (conversation, filesystem snapshot, env variables) serialised to object storage on a cadence (every N tool calls, or every M minutes). On restart, the new container hydrates from the latest checkpoint.
- Tool-call routing. The agent makes a tool call. The router validates the call, executes it in the sandbox, returns the result. Latency budget per tool call is tight (most should be under 500ms) because each adds to the wall-clock the user waits.
- Observability per-session. Every tool call logged with input, output, latency, and the model's reasoning token. Used for debugging when an agent loops or stalls, and for offline eval.
Failure modes to surface: the model API rate-limits mid- session. A tool call hangs. The sandbox runs out of disk. The user cancels. Each needs a defined behaviour.
What hasn't changed
The classic questions are still there. Twitter, Uber, URL shortener, Instagram, WhatsApp, payments, chat. They come up in roughly half the senior rounds. The framework for those is unchanged. The minute-by-minute process walkthrough still applies. The latency-numbers cheat sheet still applies.
What changed is that the panel expects you to be able to swap an LLM-flavoured question in mid-round without freezing. If you've only practised the 2022 list, the new questions feel impossible. They aren't. They use the same fundamentals (queues, caches, sharding, async workers) plus 3-4 new components (embedders, vector stores, LLM providers, agent sandboxes). Internalise the new components and you're ready.
What we hear from candidates in 2026
Across LRAI-coached system design rounds in Q1 2026, the specific mistakes that fail an LLM-flavoured question:
- Treating the LLM as a black box with infinite context. Real LLM APIs have token limits, latency variability, and per-call cost. Your design has to acknowledge each.
- Forgetting the embedding pipeline. Candidates draw the vector store and the LLM but skip "how does the data get into the vector store, on what cadence, and how do we re-index when the embedder version changes". This is the operational question panels want to see.
- No eval story. The panel will ask "how do you know the system is working". If your answer is "logs and latency metrics" without mentioning offline eval or user-feedback labels, you've missed the LLM-specific signal.
- No safety story. PII handling, prompt-injection defence, output filtering. Even one sentence on each moves you up a level.
A 14-day prep plan if you're rusty
Two weeks, not "30 days to mastery". Real reading list:
- Days 1-3: Re-read chapters 5, 6, 7 of Kleppmann's "Designing Data-Intensive Applications". Foundations don't change.
- Days 4-6: Read Anthropic's "Building effective agents" and pick one RAG tutorial end-to-end. Build a tiny working RAG over 10 markdown files. The muscle memory is what the round tests.
- Days 7-9: Three timed mocks on the classic questions (Twitter, Uber, payments). Use a kitchen timer. 45 minutes each, no pause.
- Days 10-12: Three timed mocks on the LLM-flavoured questions (RAG chatbot, vector search, long-running agent).
- Days 13-14: Re-watch a recorded mock of yourself if you have one. Look for the silence. The gaps where you stopped narrating are where you lose the panel.
Run LLM-era mock rounds with feedback
LastRound AI runs mock system-design rounds covering both the classic and LLM-era question types, with per-phase timing and real-time prompts when you've missed an expected component.
Written by
Hari
Engineering, LastRound AI
Engineer at LastRound AI. Writes about coding interviews, system design, and the patterns we see when candidates use our copilot for live technical rounds.
Further reading
- NeetCode 150 — Curated DSA practice with video explanations
- System Design Primer — 270k★ open-source system design study guide
- Designing Data-Intensive Applications — Industry-standard distributed-systems text
Share this post
Related articles
Technical prep
Blind 75 LeetCode: Complete Study Guide 2026 | LastRound AI
Technical prep
Data Structures for Coding Interviews 2026: The 8 You Actually Need | LastRound AI
Technical prep
LeetCode Patterns Cheat Sheet 2026: The 15 Patterns That Cover 90% of Problems | LastRound AI
Technical prep
NeetCode 150 vs Blind 75: Which Should You Study? 2026 Guide | LastRound AI
