System Design Interview 2026: LLM Apps, Vector DBs, and What Changed

If you studied for a system-design round in 2022, you practised designing Twitter, Uber, and a URL shortener. Those still come up. But more than half the senior rounds we see in 2026 include at least one LLM-flavoured question. “Design a RAG-backed support chatbot.” “Design the infrastructure for a coding agent that can run for hours.” “Design a vector search service for 10 billion documents.”

The panels haven’t thrown out the old playbook. They’ve added new pages to it. Here’s what’s actually new.

Question type 1: RAG-backed systems

“Design a customer-support chatbot that answers from our docs.” Variants land at every FAANG and most well-funded startups. The panel is testing whether you can reason about a stack that mixes embedding pipelines, a vector store, an LLM provider, and a feedback loop.

The components they want to see on the board:

An ingestion pipeline. Source docs → chunker → embedder → vector store. Run on doc updates. Idempotent on re-runs. Mention you’d chunk by semantic unit (markdown heading, code block) rather than by character count.
A query path. User query → query embedder → vector search top-k → rerank → context-window assembly → LLM call → response. Streaming response back to the client.
A feedback loop. Thumbs-up / thumbs-down per response, logged with the query, the retrieved chunks, and the LLM output. Used to retrain the reranker and to flag regressions.

The trade-offs they want you to articulate: embedding model latency vs quality (small models like bge-small-en run on CPU; large models like text-embedding- 3-large need GPU but win on retrieval recall). Vector store choice (Pinecone managed vs pgvector self-hosted vs Qdrant for hybrid search). Cache the embedding of common queries to skip the embed call entirely on the hot path.

Anthropic published a useful breakdown of agentic architectures in their “Building effective agents” post that’s worth reading before any LLM-system-design round. It distinguishes workflows from agents and is roughly the framing panels expect.

Question type 2: vector search at scale

“We have 10 billion document embeddings. Users hit us with 100k queries per second. Design the storage and serving layer.” This one separates candidates who’ve used a vector DB from candidates who’ve operated one.

The numbers matter. A 1536-dim float32 embedding is 6 KB. 10B embeddings is 60 TB. You’re not fitting that in RAM on one box. You’re sharding, and the sharding strategy is the interview.

Pick the index: HNSW for in-memory low-latency, IVF for disk-backed lower-cost, or hybrid (HNSW per-shard, IVF for routing). HNSW gives you sub-10ms p99 reads but memory-bound. IVF is slower but fits more per node.
Sharding by document ID hash gives even load but no query locality. Sharding by document namespace (per-tenant, per-language) preserves locality but creates hot shards. There’s no right answer. There’s a workload-shaped answer.
Caching the top-k results for high-frequency queries gets you most of the win for free. Show you’d build this in.

Reference benchmarks are public. The ann-benchmarks project compares HNSW, IVF, ScaNN, and others across datasets. If you can quote one specific number from this (“HNSW gets 95% recall at 1k QPS on the GloVe-100 dataset”), the panel knows you’ve actually engaged with the trade-offs.

Question type 3: long-running agent infrastructure

This is the newest of the three, and the one fewest candidates handle well. “Design the infrastructure for a coding agent that runs for up to 8 hours, executes shell commands, writes files, and resumes from checkpoint if the host dies.” Anthropic, OpenAI, Cognition (Devin), Cursor, and a growing list of startups ask versions of this.

The components:

Per-agent isolated execution environment. Container (Firecracker VM or gVisor) per session. Resource limits enforced. Network policy default-deny except to the model API and a whitelisted set of registries.
Checkpoint store. Agent state (conversation, filesystem snapshot, env variables) serialised to object storage on a cadence (every N tool calls, or every M minutes). On restart, the new container hydrates from the latest checkpoint.
Tool-call routing. The agent makes a tool call. The router validates the call, executes it in the sandbox, returns the result. Latency budget per tool call is tight (most should be under 500ms) because each adds to the wall-clock the user waits.
Observability per-session. Every tool call logged with input, output, latency, and the model’s reasoning token. Used for debugging when an agent loops or stalls, and for offline eval.

Failure modes to surface: the model API rate-limits mid- session. A tool call hangs. The sandbox runs out of disk. The user cancels. Each needs a defined behaviour.

What hasn’t changed

The classic questions are still there. Twitter, Uber, URL shortener, Instagram, WhatsApp, payments, chat. They come up in roughly half the senior rounds. The framework for those is unchanged. The minute-by-minute process walkthrough still applies. The latency-numbers cheat sheet still applies.

What changed is that the panel expects you to be able to swap an LLM-flavoured question in mid-round without freezing. If you’ve only practised the 2022 list, the new questions feel impossible. They aren’t. They use the same fundamentals (queues, caches, sharding, async workers) plus 3-4 new components (embedders, vector stores, LLM providers, agent sandboxes). Internalise the new components and you’re ready.

What we hear from candidates in 2026

Across LRAI-coached system design rounds in Q1 2026, the specific mistakes that fail an LLM-flavoured question:

Treating the LLM as a black box with infinite context. Real LLM APIs have token limits, latency variability, and per-call cost. Your design has to acknowledge each.
Forgetting the embedding pipeline. Candidates draw the vector store and the LLM but skip “how does the data get into the vector store, on what cadence, and how do we re-index when the embedder version changes”. This is the operational question panels want to see.
No eval story. The panel will ask “how do you know the system is working”. If your answer is “logs and latency metrics” without mentioning offline eval or user-feedback labels, you’ve missed the LLM-specific signal.
No safety story. PII handling, prompt-injection defence, output filtering. Even one sentence on each moves you up a level.

A 14-day prep plan if you’re rusty

Two weeks, not “30 days to mastery”. Real reading list:

Days 1-3: Re-read chapters 5, 6, 7 of Kleppmann’s “Designing Data-Intensive Applications“. Foundations don’t change.
Days 4-6: Read Anthropic’s “Building effective agents” and pick one RAG tutorial end-to-end. Build a tiny working RAG over 10 markdown files. The muscle memory is what the round tests.
Days 7-9: Three timed mocks on the classic questions (Twitter, Uber, payments). Use a kitchen timer. 45 minutes each, no pause.
Days 10-12: Three timed mocks on the LLM-flavoured questions (RAG chatbot, vector search, long-running agent).
Days 13-14: Re-watch a recorded mock of yourself if you have one. Look for the silence. The gaps where you stopped narrating are where you lose the panel.

Run LLM-era mock rounds with feedback

LastRound AI runs mock system-design rounds covering both the classic and LLM-era question types, with per-phase timing and real-time prompts when you’ve missed an expected component.

System Design in 2026: Why the Questions Changed

Question type 1: RAG-backed systems

Question type 2: vector search at scale

Question type 3: long-running agent infrastructure

What hasn’t changed

What we hear from candidates in 2026

A 14-day prep plan if you’re rusty

Run LLM-era mock rounds with feedback

Leave a Reply Cancel reply

System Design in 2026: Why the Questions Changed

Question type 1: RAG-backed systems

Question type 2: vector search at scale

Question type 3: long-running agent infrastructure

What hasn’t changed

What we hear from candidates in 2026

A 14-day prep plan if you’re rusty

Run LLM-era mock rounds with feedback

Keep reading

Leave a Reply Cancel reply