Data Engineer Interview Questions: From Associate to Senior (2026)

Introduction

Data engineering interviews vary significantly by level. Associate roles focus on SQL and basic pipeline concepts. Senior roles expect system design, optimization, and architectural thinking. This guide covers questions for both levels, clearly marked by difficulty.

SQL & Data Modeling

1. [Associate] Write a query to find the second-highest salary in each department.

Answer:

```sql

SELECT department_id, salary

FROM (

SELECT department_id, salary,

DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank

FROM employees

) ranked

WHERE rank = 2;

```

DENSE_RANK handles ties correctly. ROW_NUMBER would skip if there are ties at first place.

2. [Associate] Explain the difference between WHERE and HAVING.

Answer: WHERE filters rows before aggregation. HAVING filters groups after aggregation.

```sql

-- WHERE: filter individual rows

SELECT * FROM orders WHERE amount > 100;

-- HAVING: filter aggregated results

SELECT customer_id, SUM(amount)

FROM orders

GROUP BY customer_id

HAVING SUM(amount) > 1000;

```

3. [Senior] Design a data model for an e-commerce analytics platform.

Answer: Use dimensional modeling (star schema):

Fact tables:

fact_orders (order_id, customer_key, product_key, date_key, quantity, revenue)
fact_page_views (session_id, user_key, page_key, date_key, duration)

Dimension tables:

dim_customer (customer_key, name, segment, acquisition_date)
dim_product (product_key, name, category, price, supplier)
dim_date (date_key, date, month, quarter, year, is_holiday)

Design considerations:

Slowly Changing Dimensions Type 2 for customer/product history
Pre-aggregated tables for common queries
Partitioning on date for large fact tables

4. [Senior] How do you handle slowly changing dimensions?

Answer:

Type 1: Overwrite old value. Simple but loses history.
Type 2: Add new row with version/dates. Preserves history, increases storage.
Type 3: Add column for previous value. Limited history.
Type 6: Hybrid of 1, 2, 3. Current value + history rows.

Choose based on analytical needs. Type 2 is most common for full history tracking.

Data Pipelines

5. [Associate] What is ETL vs ELT? When would you use each?

Answer:

ETL (Extract, Transform, Load): Transform data before loading into destination. Good when destination has limited compute (traditional data warehouses).
ELT (Extract, Load, Transform): Load raw data, transform in destination. Modern approach with powerful cloud warehouses (Snowflake, BigQuery).

ELT is increasingly preferred because cloud warehouses scale compute cheaply, and raw data in lake allows flexible transformations.

6. [Associate] How do you handle pipeline failures?

Answer:

Idempotency: Running twice produces same result (use MERGE, not INSERT)
Retries: Exponential backoff for transient failures
Checkpointing: Resume from last successful point
Alerting: PagerDuty/Slack notifications on failure
Dead letter queues: Capture failed records for later processing
Monitoring: Track success rates, latency, data quality

7. [Senior] Design a real-time data pipeline for fraud detection.

Answer:

Ingestion: Kafka for transaction stream, Kafka Connect for source integration

Processing: Flink or Spark Structured Streaming for:

Feature computation (velocity, deviation from normal)
ML model scoring (deployed via ONNX or custom UDF)
Rule-based checks

Storage:

Redis for feature store (low-latency lookups)
Kafka for scoring results
Delta Lake for historical analysis

Latency budget: <100ms end-to-end

Challenges: Late data, exactly-once semantics, model updates without downtime

8. [Senior] How do you ensure data quality in pipelines?

Answer:

Prevention:

Schema enforcement at ingestion
Contract testing between producer/consumer

Detection:

Great Expectations or dbt tests
Statistical anomaly detection (z-scores on counts)
Freshness checks (data should arrive by time X)

Response:

Circuit breakers to stop bad data propagation
Quarantine tables for invalid records
Automated alerts and incident response

Observability:

Data quality dashboards
SLAs with measurement

Big Data & Spark

9. [Associate] Explain the difference between map and flatMap in Spark.

Answer:

map: 1:1 transformation, one input row produces one output row
flatMap: 1:N transformation, one input row can produce zero or more output rows

```python

# map: transform each element

rdd.map(lambda x: x * 2) # [1,2,3] -> [2,4,6]

# flatMap: flatten nested results

rdd.flatMap(lambda x: x.split(" ")) # ["hello world"] -> ["hello", "world"]

```

10. [Associate] What is a shuffle in Spark and why is it expensive?

Answer: Shuffle redistributes data across partitions (required for joins, groupBy, repartition). It's expensive because:

Data written to disk
Data transferred over network
Serialization/deserialization overhead
Causes stage boundaries

Minimize shuffles by: broadcast joins for small tables, proper partitioning, avoiding unnecessary groupBys.

11. [Senior] How do you optimize a slow Spark job?

Answer:

Check Spark UI: Identify longest stages, shuffle sizes, task skew
Data skew: Salting keys, broadcast join, adaptive query execution
Partitioning: Right number of partitions (2-3x cores), partition by join/filter keys
Serialization: Use Kryo, avoid UDFs when possible
Memory: Tune spark.memory.fraction, check for spills
Joins: Broadcast hint for small tables, sort-merge for large
Caching: Cache frequently accessed data (storage level appropriate to size)
Predicate pushdown: Filter early, use columnar formats

12. [Senior] Explain Spark's execution model and memory management.

Answer:

Execution:

Driver creates DAG of stages
Tasks are units of work (one per partition per stage)
Executors run tasks in parallel

Memory (unified memory model):

Execution: Shuffles, joins, sorts, aggregations
Storage: Cached RDDs/DataFrames
Dynamic allocation between execution and storage

Spilling: When memory is insufficient, data spills to disk, significantly slowing operations.

Key configs: spark.executor.memory, spark.memory.fraction, spark.sql.shuffle.partitions

Data Warehousing

13. [Associate] What is the difference between OLTP and OLAP?

Answer:

OLTP (Online Transaction Processing): Operational databases. Many small transactions, normalized schemas, row-oriented. Example: PostgreSQL for app backend.
OLAP (Online Analytical Processing): Analytics databases. Complex queries on large datasets, denormalized schemas, columnar storage. Example: Snowflake, BigQuery.

14. [Senior] Compare Snowflake, BigQuery, and Redshift.

Answer:

Feature	Snowflake	BigQuery	Redshift
Storage/Compute	Separated	Separated	Together (RA3 separated)
Scaling	Automatic	Automatic	Manual clusters
Pricing	Per-second compute	Per-query bytes scanned	Per-hour node
Best for	Multi-cloud, varied workloads	Google ecosystem, serverless	AWS-heavy, predictable workloads

Choose Snowflake for multi-cloud flexibility, BigQuery for serverless simplicity, Redshift for deep AWS integration.

15. [Senior] How do you design a data lakehouse architecture?

Answer:

Storage: S3/ADLS with Delta Lake, Iceberg, or Hudi format

Benefits: ACID transactions, time travel, schema evolution on data lake

Layers:

Bronze: Raw data as ingested
Silver: Cleaned, conformed, deduplicated
Gold: Business-level aggregates, feature stores

Compute: Spark/Databricks for processing, Trino/Presto for ad-hoc queries

Governance: Unity Catalog or AWS Lake Formation for access control

System Design

16. [Senior] Design a data platform for a company with 10 billion events per day.

Answer:

Ingestion:

Kafka (MSK or Confluent) with multiple partitions
~115K events/second sustained, plan for 3x peaks
Multiple consumer groups for different use cases

Streaming layer:

Flink for real-time transformations and aggregations
Output to Kafka topics for downstream consumers

Batch layer:

Spark on EMR/Databricks for historical reprocessing
Delta Lake for storage (compaction, Z-ordering)

Serving:

Snowflake/BigQuery for SQL analytics
Druid/ClickHouse for real-time dashboards
Redis for feature serving

Data quality: Great Expectations, Monte Carlo

Orchestration: Airflow/Dagster

Cost optimization: Spot instances, auto-scaling, data lifecycle policies

17. [Senior] How do you handle late-arriving data in streaming systems?

Answer:

Watermarks: Define how late data can arrive (e.g., 10 minutes)
Windowing: Tumbling, sliding, or session windows with allowed lateness
State management: Keep state long enough for late data
Reprocessing: Lambda architecture or Kappa with compacted topics
Business rules: Define what "late" means for your use case

Trade-off: Longer lateness allowance = more complete data but higher latency and state.

Behavioral Questions

18. [All levels] Tell me about a data pipeline you built that you're proud of.

Answer: Use STAR format emphasizing:

Scale: Data volumes processed
Impact: Business value delivered
Challenges: Technical problems solved
Decisions: Trade-offs you made and why

19. [Senior] How do you prioritize data requests from multiple stakeholders?

Answer:

Understand impact: Revenue, users affected, strategic importance
Quick wins: Some requests take 30 minutes but add immediate value
Batch similar work: Group related requests
Self-service: Build tools so stakeholders can answer questions themselves
Communication: Set expectations, provide estimates, update on progress
Documentation: Capture requirements to avoid rework

20. [Senior] How do you ensure data security and compliance?

Answer:

Access control: Role-based, column-level security, row-level security
Encryption: At rest (managed keys), in transit (TLS)
PII handling: Masking, tokenization, separate processing pipelines
Audit logging: Who accessed what when
Data retention: Automated deletion per policy
Compliance: GDPR right to deletion, SOC2 controls

Additional Resources

Databricks Academy - Official Databricks training
Snowflake University - Snowflake learning paths
Mode SQL Tutorial - Comprehensive SQL practice
Data Architect Career Path - Career progression guide
Designing Data-Intensive Applications - Essential reading for senior roles

Conclusion

Data engineering interviews assess both technical depth and practical experience. Junior candidates should nail SQL and understand pipeline basics. Senior candidates need system design skills and the ability to discuss trade-offs at scale.

Certifications for Data Engineers

Stand out in your interview with industry-recognized credentials:

For Associate Level:

Databricks Data Engineer Associate - Lakehouse fundamentals, 1,200+ questions
Snowflake SnowPro Core - Data warehouse essentials
AWS Data Engineer Associate - AWS data services mastery

For Senior Level:

Databricks Data Engineer Professional - Advanced lakehouse architecture
AWS Solutions Architect Associate - Broaden your system design skills

BetaStudy's practice questions include real-world scenarios and detailed explanations to build interview confidence.

Start your data engineering journey with a free trial.

Data Engineer Interview Questions: From Associate to Senior (2026)

Introduction

SQL & Data Modeling

1. [Associate] Write a query to find the second-highest salary in each department.

2. [Associate] Explain the difference between WHERE and HAVING.

3. [Senior] Design a data model for an e-commerce analytics platform.

4. [Senior] How do you handle slowly changing dimensions?

Data Pipelines

5. [Associate] What is ETL vs ELT? When would you use each?

6. [Associate] How do you handle pipeline failures?

7. [Senior] Design a real-time data pipeline for fraud detection.

8. [Senior] How do you ensure data quality in pipelines?

Big Data & Spark

9. [Associate] Explain the difference between map and flatMap in Spark.

10. [Associate] What is a shuffle in Spark and why is it expensive?

11. [Senior] How do you optimize a slow Spark job?

12. [Senior] Explain Spark's execution model and memory management.

Data Warehousing

13. [Associate] What is the difference between OLTP and OLAP?

14. [Senior] Compare Snowflake, BigQuery, and Redshift.

15. [Senior] How do you design a data lakehouse architecture?

System Design

16. [Senior] Design a data platform for a company with 10 billion events per day.

17. [Senior] How do you handle late-arriving data in streaming systems?

Behavioral Questions

18. [All levels] Tell me about a data pipeline you built that you're proud of.

19. [Senior] How do you prioritize data requests from multiple stakeholders?

20. [Senior] How do you ensure data security and compliance?

Additional Resources

Conclusion

Certifications for Data Engineers

BetaStudy Team

Ready to Start Practicing?

Related Articles

Top 50 AWS Solutions Architect Interview Questions (2026)

Entry-Level Cloud Engineer Interview Questions (2026)

Senior DevOps Engineer Interview Questions (2026)