Data Engineer Interview Questions Guide

Data engineer interview questions are all over the place. Some companies ask almost no SQL and go heavy on systems design. Others will put a 400-line Spark job in front of you and ask what’s wrong. I’ve seen candidates who know Parquet vs ORC cold get tripped up because they couldn’t explain why their idempotent pipeline design was actually not idempotent.

At LastRound AI we watch candidates go through live data engineering rounds with our copilot running in the background – not a simulated environment, actual interviews. The most common gap isn’t technical knowledge. It’s the inability to reason out loud about trade-offs when the interviewer pushes back. Candidates freeze when asked “why not just use ELT here?” if their prepared answer assumes ETL.

The questions below are organized by what they’re actually testing, not by tool. SQL and Spark questions appear together when they probe the same underlying concept. I’ve cut anything that’s purely a vocabulary quiz with no reasoning component.

What data engineering interviews actually test

Schema design judgment: when to normalize, when not to, and why denormalization in a warehouse is different from denormalization in Postgres
Pipeline reliability: idempotency, CDC, failure handling, backfill behavior
Distributed systems trade-offs: Spark transformations, shuffle cost, skew, broadcast thresholds
SQL depth: window functions, recursive CTEs, query plan reasoning
Streaming vs batch: when each is appropriate and where exactly-once actually matters

Context worth knowing: the BLS Occupational Outlook Handbook puts the median annual wage for database administrators at $104,620 as of May 2024, with the database architects category higher at $135,980 – and the broader data/analytics engineering field is seeing 33%+ growth projections through 2034 per BLS data. The job market is real. So is the competition. (BLS source)

Schema design and SQL

These questions come first in most interviews because they’re cheap to ask and they separate people who’ve operated real warehouses from people who’ve read about them. If you get a star schema question, don’t just define it – say when you’d pick it over a normalized approach and why.

Star schema vs snowflake schema: explain both and say which you’d pick.

Star schema puts a fact table at the center with denormalized dimension tables hanging directly off it. Queries are simple, and for analytics workloads that’s usually worth the extra storage cost.

Snowflake normalizes those dimensions into sub-tables. Less redundancy, more joins. If you’re paying per-query on BigQuery or Snowflake (the warehouse), those joins add up fast.

In practice, I pick star schema for most analytics work. The performance wins from fewer joins outweigh the storage overhead, especially when your dimension tables are small. Snowflake schema makes sense if your dimensions have genuinely large, reusable hierarchies – geography being the textbook case.

What are slowly changing dimensions? Walk through Type 1, 2, and 3.

Type 1: Overwrite. No history. Fine when the old value genuinely doesn’t matter – fixing a typo in a customer name, for example.

Type 2: New row per change, with effective dates and an is_current flag. Full history preserved. This is what most teams actually need for anything customer-facing.

Type 3: Add a “previous_value” column. You get one step of history. I’ve never seen this used in production outside of a textbook – in real life you either want no history (Type 1) or full history (Type 2).

The follow-up question is almost always: “How do you handle Type 2 when upstream systems don’t have a reliable updated_at?” Know your answer. (Hint: CDC via Debezium reading the transaction log is cleaner than polling timestamps.)

Write a query to find duplicate records in a table.

-- Simple: shows the email and how many duplicates
SELECT email, COUNT(*) AS cnt
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

-- Better: shows the actual duplicate rows with their IDs
SELECT *
FROM (
  SELECT *,
    COUNT(*) OVER (PARTITION BY email) AS cnt
  FROM users
) t
WHERE cnt > 1
ORDER BY email;

The second version is what you want for debugging because you can see the actual rows and compare them. If the interviewer asks you to delete duplicates keeping only the latest, add ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at DESC) and keep rows where that equals 1.

ROW_NUMBER vs RANK vs DENSE_RANK. When does the difference matter?

ROW_NUMBER: Always unique, always sequential. Use for deduplication – you want exactly one row to get rank 1.

RANK: Ties get the same number, but the next rank skips. So two rows tied at 2 means the next row gets 4. Use when you need to communicate actual competitive rank.

DENSE_RANK: Ties get the same number, no skipping. Two tied at 2 means the next row gets 3. Use when you need consecutive ranking (top 3 product categories, for example, and ties shouldn’t accidentally include a 4th).

The real question is: what does “rank” mean to your downstream consumer? Usually you want DENSE_RANK for business dashboards and ROW_NUMBER for engineering operations like deduplication.

OLTP vs OLAP: what’s the actual difference and why does it matter for schema design?

OLTP systems handle many small, fast read/write operations. Postgres, MySQL. Normalized data to preserve integrity. Row-oriented storage because you’re usually fetching a single row by primary key.

OLAP systems run a small number of big analytical reads. Snowflake, BigQuery, Redshift. Denormalized for fewer joins. Columnar storage because you’re usually aggregating a few columns across millions of rows – you don’t need to read the other columns at all.

Why it matters for schema design: if you’re designing for an OLAP warehouse and you normalize like it’s an OLTP system, you’ll spend a lot of time explaining why your dashboards are slow.

How do you optimize a slow SQL query? Walk through your actual process.

First thing I do is run EXPLAIN ANALYZE and look for sequential scans on large tables and high-cost sort operations. That tells me where the time is actually going.

Check for missing indexes on join keys and filter columns
Push filters as early as possible – don’t aggregate then filter if you can filter then aggregate
Replace correlated subqueries with CTEs or JOINs
Select only the columns you need, never SELECT *
For large tables, check if partitioning would let the planner skip partitions entirely
Look at data types – joining an integer to a varchar is a type coercion on every row
Check table statistics freshness – ANALYZE if they’re stale

If it’s still slow after all of that, the query is probably fundamentally wrong – usually because the data model requires it to join across too many rows. That’s a schema problem, not a query problem.

Write a query for running total and 7-day moving average.

SELECT
  sale_date,
  revenue,
  SUM(revenue) OVER (ORDER BY sale_date) AS running_total,
  AVG(revenue) OVER (
    ORDER BY sale_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS moving_avg_7day
FROM daily_sales
ORDER BY sale_date;

One gotcha: if you have gaps in dates, “7 preceding rows” is not the same as “previous 7 calendar days.” If you need calendar-based windows with gaps handled, you’ll want a date spine joined in first.

What’s the practical difference between DELETE, TRUNCATE, and DROP?

DELETE: Row-by-row removal, fully logged, triggers fire, can have a WHERE clause, can roll back. Slow on large tables.

TRUNCATE: Removes all rows with minimal logging, no WHERE clause, resets identity sequences. Much faster. Can’t roll back in most databases (Postgres is an exception – TRUNCATE inside a transaction is safe).

DROP: Removes the table itself. No recovery without a backup.

In a data pipeline context: TRUNCATE + reload is a common idempotency pattern. Just know that if your pipeline fails midway through, the table is empty until the reload completes – which may affect downstream queries running concurrently.

Pipeline design and reliability

This is where most mid-level candidates get stuck. They know the tools – Airflow, dbt, Debezium – but they haven’t thought hard about failure modes. The interviewer is checking whether you design for the happy path or for actual production.

ETL vs ELT: when does the distinction actually matter?

ETL transforms before loading – you have clean, structured data by the time it hits the destination. Made sense when destination compute was expensive or constrained.

ELT loads raw data first, transforms in the warehouse. Modern columnar warehouses (Snowflake, BigQuery, Redshift) have so much compute that transforming in-warehouse is often cheaper and faster than transforming upstream.

The real reason to prefer ELT is debuggability. When something breaks downstream, you still have the raw data. With ETL, if your transformation had a bug, the original data is gone – or in a separate staging system you may not have access to. dbt is essentially an ELT framework: load raw, transform in SQL, test the output.

What does “idempotent pipeline” actually mean? How do you make one?

Running the same pipeline twice – same inputs, same time range – produces identical output. No duplicates. No missing data.

In practice:

Use MERGE/UPSERT instead of blind INSERT
Or partition-replace: delete the target partition, then load – making the whole operation atomic
Process records by natural key, not by arrival order
Store run metadata and skip re-runs that already completed successfully

A common mistake I see: people use INSERT IGNORE thinking it handles duplicates, but it silently drops records that conflict on any unique key – including legitimate updates to an existing row. MERGE is safer.

What is CDC? Compare log-based, timestamp-based, and trigger-based approaches.

Change Data Capture captures inserts, updates, and deletes from a source database for incremental processing.

Log-based (Debezium): Reads the database transaction log directly. Captures all changes including deletes. No load on the source system. The most reliable approach. The downside is it requires database-level permissions and the transaction log retention needs to be long enough for your pipeline lag.

Timestamp-based: Polls rows WHERE updated_at > last_run. Simple to set up, but misses hard deletes, misses rows where updated_at wasn’t properly set, and polls create load on the source.

Trigger-based: Database triggers write changes to a separate table. Adds overhead to every write on the source. Rarely the right choice for high-write tables.

My default: log-based via Debezium if the source is Postgres or MySQL. Timestamp-based only if I can’t get transaction log access.

How do you handle failed records without taking down the whole pipeline?

The pattern I use consistently: route bad records to a dead letter queue (a separate table or Kafka topic, not just a log file) rather than failing the whole job.

Capture the original raw record plus the error message and timestamp
Alert when the failure rate exceeds a threshold – 0.1% failing silently is a data quality problem you’ll discover three weeks later
Build a reprocessing mechanism for the dead letter queue once the upstream issue is fixed
Implement exponential backoff on transient failures (network timeouts, rate limits) before routing to dead letter

If an interviewer asks about this and you don’t mention alerting thresholds, they’ll follow up with “how would you know if 20% of records are silently failing?” Answer that proactively.

Explain data lineage. Why do teams actually care about it?

Lineage tracks where data came from, what transformations happened to it, and what downstream assets depend on it.

The practical reason teams care: impact analysis. If a source table changes its schema, you need to know what breaks downstream without manually tracing through 80 dbt models. Tools like dbt’s built-in DAG, OpenLineage, and Marquez automate this.

The compliance angle matters too – for GDPR right-to-deletion requests, you need to know every table that touched a user’s data. If you don’t have lineage, you’re doing manual grep through 3 years of SQL files.

What is dbt and where does it fit in the stack?

dbt handles the T in ELT. You write SQL SELECT statements, dbt wraps them into CREATE TABLE or CREATE VIEW statements, handles dependencies between models, runs tests, and generates documentation with lineage graphs.

Where it fits: after ingestion (Fivetran, Airbyte, custom loaders) and before BI tools (Looker, Tableau, Metabase). The raw zone holds the source data, dbt transforms it into staging and mart layers.

What dbt doesn’t do: it’s not an orchestrator. You still need Airflow, Prefect, or Dagster to schedule dbt runs and handle dependencies with non-dbt jobs. Mixing up “dbt does orchestration” is a quick way to raise an eyebrow in an interview.

Batch vs streaming: when does streaming actually justify the complexity?

Batch processes bounded datasets on a schedule. Higher throughput, easier debugging, simpler re-runs. Good for daily reports, ML training, and any use case where 15-minute-old data is fine.

Streaming processes data continuously with low latency. More complex, exactly-once semantics are genuinely hard, reprocessing historical data is painful.

My honest opinion: most teams that build streaming pipelines would have been better off with micro-batch (5-10 minute batch intervals). Real-time dashboards that need sub-second latency are real – fraud detection, live leaderboards. But “the business wants real-time” often means they want data that’s less than 10 minutes old, which Airflow and a smaller Redshift cluster handles fine for a fraction of the operational cost.

How do you build data quality checks into a pipeline?

Schema validation at ingestion (reject records with missing required fields before they enter the warehouse), then row-count and distribution checks after each transformation layer.

Tools that help: Great Expectations for Python pipelines, dbt tests for warehouse transformations (not_null, unique, accepted_values, relationship tests), and custom SQL checks for business-logic assertions like “revenue can’t be negative.”

The thing most teams skip: freshness checks. A table that stopped updating 6 hours ago is a data quality failure even if all the rows that are there look correct. Set up a check that alerts if max(updated_at) is older than expected.

Spark, distributed computing, and storage formats

The Stack Overflow Developer Survey 2024 – 65,000+ respondents – confirmed PostgreSQL leads all database usage at 49%, but for distributed workloads at scale, the Spark questions below are what separate junior from senior data engineers in interviews at companies like Databricks, Stripe, Airbnb, or any team running multi-terabyte pipelines. (Stack Overflow 2024 Survey)

Explain Spark architecture: driver, executors, cluster manager.

Driver: Runs the main program, creates the SparkContext, builds the execution plan, schedules tasks. There’s one driver per application.

Executors: JVM processes on worker nodes that actually run the tasks and cache data. Each task runs in one executor thread.

Cluster Manager: Allocates resources across applications. YARN on Hadoop clusters, Kubernetes in modern setups, Standalone for local testing.

Follow-up you’ll almost always get: “What happens if the driver dies?” Answer: the whole application fails. This is why long-running Spark streaming jobs checkpoint state to HDFS or S3 so they can restart without replaying from the beginning.

Transformations vs actions in Spark. Why does the distinction matter?

Transformations (map, filter, join, groupBy) are lazy – they build a logical plan but don’t execute. Actions (collect, count, write, show) trigger actual computation.

Why it matters: Spark’s optimizer (Catalyst) can only optimize across transformations before an action fires. If you trigger unnecessary actions in a loop, you lose that optimization and pay the overhead of starting computation multiple times.

Common mistake: calling df.count() inside a loop for logging purposes. Each count is a full job. Cache the dataframe and count once outside the loop, or use Spark’s built-in metrics instead.

What is data skew? How do you fix it?

Skew happens when data is unevenly distributed across partitions. One partition gets 80% of the records because most rows share the same join key – a common case when you join on something like country_code and half your data is “US”.

Standard fixes:

Broadcast join: if the smaller table fits in memory (~10MB by default, tunable), Spark broadcasts it to all executors and avoids the shuffle entirely
Salting: add a random suffix (0-9) to the hot key, explode the dimension table to match, then aggregate and remove the salt after the join
Adaptive Query Execution (Spark 3.0+): automatically coalesces small partitions and converts to broadcast join when it detects size thresholds at runtime
Separate processing for known hot keys: handle the “US” bucket as a special case and union the result

Narrow vs wide transformations. Why does this matter for performance?

Narrow transformations (map, filter, union): each input partition contributes to exactly one output partition. No data movement across the network. Fast.

Wide transformations (groupBy, join, distinct, sort): input partitions contribute to multiple output partitions. Requires a shuffle – data gets serialized, sent across the network, deserialized. This is usually where 80% of Spark job time goes.

Practical implication: minimize shuffles. Every groupBy, join, and distinct is potentially a shuffle. If you can filter data before a join (push the filter down), you’re reducing the volume that gets shuffled.

Partitioning vs bucketing in Spark and Hive. When do you use which?

Partitioning by a column (usually date) creates a directory per value. Queries that filter on that column skip irrelevant directories entirely – partition pruning. Great for time-series data.

Bucketing distributes rows into a fixed number of files by hash of a column. Tables bucketed on the same key and the same number of buckets can join without a shuffle – the data is co-located. Great for frequently-joined large tables.

They’re not mutually exclusive. A common pattern: partition by date (for freshness queries), bucket by user_id within each partition (for user-level joins).

Storage formats: Parquet, ORC, Avro, Delta Lake. When does each fit?

Parquet: Columnar, compressed, schema embedded. Default choice for analytics. Works everywhere – Spark, Athena, BigQuery, Hive.

ORC: Also columnar, better ACID support for Hive, slightly better compression on some workloads. If your stack is entirely Hive-based, ORC is worth considering. Otherwise Parquet’s ecosystem is wider.

Avro: Row-based, schema stored in the file, excellent for schema evolution. Standard for Kafka messages and event streaming because row-oriented is appropriate when consuming one event at a time.

Delta Lake / Apache Iceberg / Apache Hudi: These are table formats layered on top of Parquet. They add ACID transactions, time travel, schema evolution, and partition evolution to otherwise static Parquet files. Delta is Databricks-native; Iceberg is cloud-agnostic and gaining ground on AWS and GCP.

Data lake vs data warehouse vs lakehouse. Explain the trade-offs.

Data lake (S3, GCS, ADLS): cheap object storage, any format, any schema. Flexible but no governance – query performance varies wildly and there’s no ACID.

Data warehouse (Snowflake, BigQuery, Redshift): structured, curated, fast, governed. Expensive compute. Schema-on-write means transformations happen before data lands.

Lakehouse (Delta Lake, Iceberg, Hudi on top of cloud storage): takes the cheap storage of a data lake and adds warehouse-like features – ACID, time travel, schema enforcement, fast metadata. Increasingly the default architecture for teams that need both ML workloads (raw access to unstructured data) and BI workloads (structured, governed queries).

If a company is starting fresh today, I’d suggest a lakehouse pattern over a traditional warehouse – unless they’re a pure analytics team with no ML needs, in which case Snowflake or BigQuery is simpler to operate.

Exactly-once semantics: what does it mean and when do you actually need it?

Three delivery guarantees: at-most-once (might lose records), at-least-once (might duplicate), exactly-once (neither). In distributed systems, exactly-once across both the processing and the output sink is genuinely hard.

How to achieve it in practice: idempotent writes at the sink (MERGE instead of INSERT), transactional output (Kafka transactions or Delta Lake’s transactional writes), and checkpointing your stream processing state so you can resume from exactly where you left off.

When you actually need it: financial transactions (duplicating a charge is bad), inventory systems, anything where double-counting has real consequences. For analytics pipelines where a dashboard re-calculates on query, at-least-once with idempotent loading is usually good enough and much cheaper to implement.

Practice these questions live

Reading answers is not the same as saying them to an interviewer who’s pushing back. We built LastRound AI’s interview copilot to help with exactly this – working through data engineer interview questions in a context that feels closer to the real thing, with a tool that responds to your reasoning rather than just checking keywords.

What interviewers are actually watching for

I’ve seen candidates answer every question technically correctly and still not get an offer. The pattern: they gave textbook answers without ever saying what they’d actually do or why. Interviewers at Airbnb, Meta, and Stripe – three companies where data engineering is a core function – are trying to figure out if you’d be useful on a production incident at 2am, not whether you’ve memorized the SCD types.

Things that raise flags: not mentioning failure modes when asked to design a pipeline; only knowing one tool deeply (it’s fine to prefer Airflow, it’s a problem if you’ve never thought about Prefect or Dagster); not considering cost implications of your design choices; giving abstract answers about scale (“it handles billions of records”) without being able to say how you’d debug a performance problem.

Things that work well: saying “the right answer depends on X and Y” and then actually answering both cases; acknowledging when you’ve made a mistake in your reasoning mid-answer (this reads as intellectual honesty, not weakness); naming specific production incidents you’ve debugged, even vaguely. A candidate who says “we had a skew issue on a user_id join because 40% of our events came from one internal test account – we fixed it by filtering that account ID before the join” is much more credible than someone who lists the theoretical solutions in order.

The Data Engineer Interview Questions Worth Actually Preparing For

What data engineering interviews actually test

Schema design and SQL

Pipeline design and reliability

Spark, distributed computing, and storage formats

Practice these questions live

What interviewers are actually watching for

Leave a Reply Cancel reply

The Data Engineer Interview Questions Worth Actually Preparing For

What data engineering interviews actually test

Schema design and SQL

Pipeline design and reliability

Spark, distributed computing, and storage formats

Practice these questions live

What interviewers are actually watching for

Keep reading

Leave a Reply Cancel reply