Data Engineer Interview Questions: From Associate to Senior (2026)
Comprehensive interview questions for data engineers at all levels. Covers SQL, data pipelines, Spark, data warehousing, and system design.
Introduction
Data engineering interviews vary significantly by level. Associate roles focus on SQL and basic pipeline concepts. Senior roles expect system design, optimization, and architectural thinking. This guide covers questions for both levels, clearly marked by difficulty.
SQL & Data Modeling
1. [Associate] Write a query to find the second-highest salary in each department.
Answer:
```sql
SELECT department_id, salary
FROM (
SELECT department_id, salary,
DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank
FROM employees
) ranked
WHERE rank = 2;
```
DENSE_RANK handles ties correctly. ROW_NUMBER would skip if there are ties at first place.
2. [Associate] Explain the difference between WHERE and HAVING.
Answer: WHERE filters rows before aggregation. HAVING filters groups after aggregation.
```sql
-- WHERE: filter individual rows
SELECT * FROM orders WHERE amount > 100;
-- HAVING: filter aggregated results
SELECT customer_id, SUM(amount)
FROM orders
GROUP BY customer_id
HAVING SUM(amount) > 1000;
```
3. [Senior] Design a data model for an e-commerce analytics platform.
Answer: Use dimensional modeling (star schema):
Fact tables:
- fact_orders (order_id, customer_key, product_key, date_key, quantity, revenue)
- fact_page_views (session_id, user_key, page_key, date_key, duration)
Dimension tables:
- dim_customer (customer_key, name, segment, acquisition_date)
- dim_product (product_key, name, category, price, supplier)
- dim_date (date_key, date, month, quarter, year, is_holiday)
Design considerations:
- Slowly Changing Dimensions Type 2 for customer/product history
- Pre-aggregated tables for common queries
- Partitioning on date for large fact tables
4. [Senior] How do you handle slowly changing dimensions?
Answer:
- Type 1: Overwrite old value. Simple but loses history.
- Type 2: Add new row with version/dates. Preserves history, increases storage.
- Type 3: Add column for previous value. Limited history.
- Type 6: Hybrid of 1, 2, 3. Current value + history rows.
Choose based on analytical needs. Type 2 is most common for full history tracking.
Data Pipelines
5. [Associate] What is ETL vs ELT? When would you use each?
Answer:
- ETL (Extract, Transform, Load): Transform data before loading into destination. Good when destination has limited compute (traditional data warehouses).
- ELT (Extract, Load, Transform): Load raw data, transform in destination. Modern approach with powerful cloud warehouses (Snowflake, BigQuery).
ELT is increasingly preferred because cloud warehouses scale compute cheaply, and raw data in lake allows flexible transformations.
6. [Associate] How do you handle pipeline failures?
Answer:
- Idempotency: Running twice produces same result (use MERGE, not INSERT)
- Retries: Exponential backoff for transient failures
- Checkpointing: Resume from last successful point
- Alerting: PagerDuty/Slack notifications on failure
- Dead letter queues: Capture failed records for later processing
- Monitoring: Track success rates, latency, data quality
7. [Senior] Design a real-time data pipeline for fraud detection.
Answer:
Ingestion: Kafka for transaction stream, Kafka Connect for source integration
Processing: Flink or Spark Structured Streaming for:
- Feature computation (velocity, deviation from normal)
- ML model scoring (deployed via ONNX or custom UDF)
- Rule-based checks
Storage:
- Redis for feature store (low-latency lookups)
- Kafka for scoring results
- Delta Lake for historical analysis
Latency budget: <100ms end-to-end
Challenges: Late data, exactly-once semantics, model updates without downtime
8. [Senior] How do you ensure data quality in pipelines?
Answer:
Prevention:
- Schema enforcement at ingestion
- Contract testing between producer/consumer
Detection:
- Great Expectations or dbt tests
- Statistical anomaly detection (z-scores on counts)
- Freshness checks (data should arrive by time X)
Response:
- Circuit breakers to stop bad data propagation
- Quarantine tables for invalid records
- Automated alerts and incident response
Observability:
- Data quality dashboards
- SLAs with measurement
Big Data & Spark
9. [Associate] Explain the difference between map and flatMap in Spark.
Answer:
- map: 1:1 transformation, one input row produces one output row
- flatMap: 1:N transformation, one input row can produce zero or more output rows
```python
# map: transform each element
rdd.map(lambda x: x * 2) # [1,2,3] -> [2,4,6]
# flatMap: flatten nested results
rdd.flatMap(lambda x: x.split(" ")) # ["hello world"] -> ["hello", "world"]
```
10. [Associate] What is a shuffle in Spark and why is it expensive?
Answer: Shuffle redistributes data across partitions (required for joins, groupBy, repartition). It's expensive because:
- Data written to disk
- Data transferred over network
- Serialization/deserialization overhead
- Causes stage boundaries
Minimize shuffles by: broadcast joins for small tables, proper partitioning, avoiding unnecessary groupBys.
11. [Senior] How do you optimize a slow Spark job?
Answer:
- Check Spark UI: Identify longest stages, shuffle sizes, task skew
- Data skew: Salting keys, broadcast join, adaptive query execution
- Partitioning: Right number of partitions (2-3x cores), partition by join/filter keys
- Serialization: Use Kryo, avoid UDFs when possible
- Memory: Tune spark.memory.fraction, check for spills
- Joins: Broadcast hint for small tables, sort-merge for large
- Caching: Cache frequently accessed data (storage level appropriate to size)
- Predicate pushdown: Filter early, use columnar formats
12. [Senior] Explain Spark's execution model and memory management.
Answer:
Execution:
- Driver creates DAG of stages
- Tasks are units of work (one per partition per stage)
- Executors run tasks in parallel
Memory (unified memory model):
- Execution: Shuffles, joins, sorts, aggregations
- Storage: Cached RDDs/DataFrames
- Dynamic allocation between execution and storage
Spilling: When memory is insufficient, data spills to disk, significantly slowing operations.
Key configs: spark.executor.memory, spark.memory.fraction, spark.sql.shuffle.partitions
Data Warehousing
13. [Associate] What is the difference between OLTP and OLAP?
Answer:
- OLTP (Online Transaction Processing): Operational databases. Many small transactions, normalized schemas, row-oriented. Example: PostgreSQL for app backend.
- OLAP (Online Analytical Processing): Analytics databases. Complex queries on large datasets, denormalized schemas, columnar storage. Example: Snowflake, BigQuery.
14. [Senior] Compare Snowflake, BigQuery, and Redshift.
Answer:
| Feature | Snowflake | BigQuery | Redshift |
|---|---|---|---|
| Storage/Compute | Separated | Separated | Together (RA3 separated) |
| Scaling | Automatic | Automatic | Manual clusters |
| Pricing | Per-second compute | Per-query bytes scanned | Per-hour node |
| Best for | Multi-cloud, varied workloads | Google ecosystem, serverless | AWS-heavy, predictable workloads |
Choose Snowflake for multi-cloud flexibility, BigQuery for serverless simplicity, Redshift for deep AWS integration.
15. [Senior] How do you design a data lakehouse architecture?
Answer:
Storage: S3/ADLS with Delta Lake, Iceberg, or Hudi format
Benefits: ACID transactions, time travel, schema evolution on data lake
Layers:
- Bronze: Raw data as ingested
- Silver: Cleaned, conformed, deduplicated
- Gold: Business-level aggregates, feature stores
Compute: Spark/Databricks for processing, Trino/Presto for ad-hoc queries
Governance: Unity Catalog or AWS Lake Formation for access control
System Design
16. [Senior] Design a data platform for a company with 10 billion events per day.
Answer:
Ingestion:
- Kafka (MSK or Confluent) with multiple partitions
- ~115K events/second sustained, plan for 3x peaks
- Multiple consumer groups for different use cases
Streaming layer:
- Flink for real-time transformations and aggregations
- Output to Kafka topics for downstream consumers
Batch layer:
- Spark on EMR/Databricks for historical reprocessing
- Delta Lake for storage (compaction, Z-ordering)
Serving:
- Snowflake/BigQuery for SQL analytics
- Druid/ClickHouse for real-time dashboards
- Redis for feature serving
Data quality: Great Expectations, Monte Carlo
Orchestration: Airflow/Dagster
Cost optimization: Spot instances, auto-scaling, data lifecycle policies
17. [Senior] How do you handle late-arriving data in streaming systems?
Answer:
- Watermarks: Define how late data can arrive (e.g., 10 minutes)
- Windowing: Tumbling, sliding, or session windows with allowed lateness
- State management: Keep state long enough for late data
- Reprocessing: Lambda architecture or Kappa with compacted topics
- Business rules: Define what "late" means for your use case
Trade-off: Longer lateness allowance = more complete data but higher latency and state.
Behavioral Questions
18. [All levels] Tell me about a data pipeline you built that you're proud of.
Answer: Use STAR format emphasizing:
- Scale: Data volumes processed
- Impact: Business value delivered
- Challenges: Technical problems solved
- Decisions: Trade-offs you made and why
19. [Senior] How do you prioritize data requests from multiple stakeholders?
Answer:
- Understand impact: Revenue, users affected, strategic importance
- Quick wins: Some requests take 30 minutes but add immediate value
- Batch similar work: Group related requests
- Self-service: Build tools so stakeholders can answer questions themselves
- Communication: Set expectations, provide estimates, update on progress
- Documentation: Capture requirements to avoid rework
20. [Senior] How do you ensure data security and compliance?
Answer:
- Access control: Role-based, column-level security, row-level security
- Encryption: At rest (managed keys), in transit (TLS)
- PII handling: Masking, tokenization, separate processing pipelines
- Audit logging: Who accessed what when
- Data retention: Automated deletion per policy
- Compliance: GDPR right to deletion, SOC2 controls
Additional Resources
- Databricks Academy - Official Databricks training
- Snowflake University - Snowflake learning paths
- Mode SQL Tutorial - Comprehensive SQL practice
- Data Architect Career Path - Career progression guide
- Designing Data-Intensive Applications - Essential reading for senior roles
Conclusion
Data engineering interviews assess both technical depth and practical experience. Junior candidates should nail SQL and understand pipeline basics. Senior candidates need system design skills and the ability to discuss trade-offs at scale.
Certifications for Data Engineers
Stand out in your interview with industry-recognized credentials:
For Associate Level:
- Databricks Data Engineer Associate - Lakehouse fundamentals, 1,200+ questions
- Snowflake SnowPro Core - Data warehouse essentials
- AWS Data Engineer Associate - AWS data services mastery
For Senior Level:
- Databricks Data Engineer Professional - Advanced lakehouse architecture
- AWS Solutions Architect Associate - Broaden your system design skills
BetaStudy's practice questions include real-world scenarios and detailed explanations to build interview confidence.
Start your data engineering journey with a free trial.