SQL & Data Manipulation

Want the Data Engineering version?

Go deeper on incremental loads, deduplication, SCD2, and data quality checks in the Analytical Engineering track:

Advanced SQL (Postgres) for Analytical Engineering →

What Can You Expect?

You can expect SQL coding questions that involve:

Writing complex queries, joining tables, aggregating data
Using window functions and optimizing query performance
Analyzing a large dataset or solving a business problem using SQL

How to Prep

Practice writing SQL queries regularly and focus on efficiency.
Be prepared to explain your code and its logic.
Resources:

Key SQL Concepts

SELECT, FROM, WHERE: Basic query structure.
JOINs (INNER, LEFT, RIGHT, FULL): Combining data from multiple tables.
GROUP BY and Aggregate Functions (COUNT, SUM, AVG, MIN, MAX): Summarizing data.
Window Functions (ROW_NUMBER, RANK, LAG, LEAD): Performing calculations across rows related to the current row.
Subqueries and CTEs (Common Table Expressions): Creating reusable query blocks.

SQL Concepts, Explained (With Examples)

1) Filtering: `WHERE` vs `HAVING`

WHERE filters rows before aggregation.
HAVING filters groups after GROUP BY.

-- WHERE filters raw rows
SELECT user_id, SUM(amount) AS total
FROM Orders
WHERE order_date >= '2025-01-01' AND order_date < '2025-02-01'
GROUP BY 1;

-- HAVING filters aggregated groups
SELECT user_id, SUM(amount) AS total
FROM Orders
GROUP BY 1
HAVING SUM(amount) >= 100;

2) JOIN types (what they return)

INNER JOIN: only matching keys on both sides.
LEFT JOIN: keep all left rows; unmatched right is NULL.
FULL OUTER JOIN: keep all rows from both sides (not available everywhere).

-- LEFT JOIN + ON clause filtering (keeps users with zero orders)
SELECT
  u.user_id,
  COUNT(o.order_id) AS orders_in_jan
FROM Users u
LEFT JOIN Orders o
  ON u.user_id = o.user_id
 AND o.order_date >= '2025-01-01'
 AND o.order_date <  '2025-02-01'
GROUP BY 1;

3) Null handling: `COALESCE` and NULL-safe math

SELECT
  user_id,
  COALESCE(SUM(amount), 0) AS total_spend
FROM Orders
GROUP BY 1;

4) Set logic: `UNION ALL` vs `UNION`

UNION ALL keeps duplicates (faster, preferred unless you need de-dup).
UNION removes duplicates (extra sort/dedup cost).

Interview-Grade SQL: Mental Model

Start from grain: What is one row in the final output? (user-day, order, session, etc.)
Separate concerns: Use CTEs to isolate filtering, joining, aggregation, and windowing.
Be explicit about time: Use half-open intervals (>= start and < next_day/month) to avoid boundary bugs.
Know your dialect: Date functions and percentiles differ. This handbook assumes Postgres unless stated otherwise.

Postgres Quick Reference (High-Value in Interviews)

Date bucketing: date_trunc('day'|'week'|'month', ts)
Half-open time filters: ts >= '2025-01-01' AND ts < '2025-02-01'
Conditional aggregation: COUNT(*) FILTER (WHERE ...) or SUM(CASE WHEN ... THEN 1 ELSE 0 END)
De-dupe shortcut: DISTINCT ON (key) ... ORDER BY key, updated_at DESC
Upsert: INSERT ... ON CONFLICT (key) DO UPDATE
NULL-safe comparison: IS DISTINCT FROM

DISTINCT ON example (latest row per user):

SELECT DISTINCT ON (user_id)
  user_id,
  plan,
  updated_at
FROM user_profiles
ORDER BY user_id, updated_at DESC;

Core Patterns You Will Be Asked

Conditional aggregation: multiple metrics in one pass using SUM(CASE WHEN ... THEN 1 ELSE 0 END).
Top-N per group: ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) then filter to rn <= N.
De-duplication: keep latest record per entity with ROW_NUMBER() and QUALIFY (if supported) or a wrapping CTE.
Retention: cohort on first event, compute period offsets, pivot/aggregate to a retention table.
Funnels: step completion counts per user/session, then aggregate conversion between steps.
Rolling metrics: 7-day rolling average/sum using window frames.

Conditional aggregation template:

SELECT
  date_trunc('day', event_ts) AS day,
  COUNT(DISTINCT user_id) AS dau,
  SUM(CASE WHEN event_name = 'purchase' THEN 1 ELSE 0 END) AS purchases,
  COUNT(DISTINCT CASE WHEN event_name = 'purchase' THEN user_id END) AS purchasing_users
FROM events
WHERE event_ts >= '2025-01-01' AND event_ts < '2025-02-01'
GROUP BY 1
ORDER BY 1;

Window Functions: Practical Recipes

1) Latest record per user (de-dupe):

WITH ranked AS (
  SELECT
    u.*,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC) AS rn
  FROM user_profiles u
)
SELECT *
FROM ranked
WHERE rn = 1;

2) Top 3 items per store by revenue:

WITH store_item AS (
  SELECT store_id, item_id, SUM(revenue) AS rev
  FROM sales
  GROUP BY 1, 2
), ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY rev DESC) AS rn
  FROM store_item
)
SELECT store_id, item_id, rev
FROM ranked
WHERE rn <= 3
ORDER BY store_id, rn;

3) 7-day rolling DAU (window frame):

WITH daily AS (
  SELECT
    date_trunc('day', event_ts) AS day,
    COUNT(DISTINCT user_id) AS dau
  FROM events
  GROUP BY 1
)
SELECT
  day,
  dau,
  AVG(dau) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS dau_7d_avg
FROM daily
ORDER BY day;

Step-by-Step Walkthrough: "Total Spend for Jan 2023 Signups"

Problem: Given Users(user_id, signup_date) and Orders(order_id, user_id, order_date, amount), compute total spend per user for users who signed up in Jan 2023.

Step 1 — Decide the output grain: one row per user_id.

Step 2 — Filter users by signup month: use a half-open interval to avoid end-of-month edge cases.

Step 3 — Join to orders: decide whether you want to keep users with zero orders. In most product analytics, you do.

Step 4 — Aggregate: sum amount per user.

WITH jan_signups AS (
  SELECT user_id
  FROM Users
  WHERE signup_date >= '2023-01-01'
    AND signup_date <  '2023-02-01'
)
SELECT
  u.user_id,
  COALESCE(SUM(o.amount), 0) AS total_spent
FROM jan_signups u
LEFT JOIN Orders o
  ON u.user_id = o.user_id
GROUP BY 1
ORDER BY 1;

Common follow-up questions:

If you only want spend within January, add an order_date filter (in the ON clause if keeping zero-order users).
If Orders has refunds/chargebacks, define whether to net them out or filter them.

Step-by-Step Walkthrough: Top-N Per Group (Classic Window Question)

Problem: “For each store_id, return the top 2 item_id by revenue.”

Step 1 — Aggregate to the ranking grain: store-item.

Step 2 — Rank within each store: ROW_NUMBER() with a deterministic tie-breaker if needed.

Step 3 — Filter to top N: rn <= 2.

WITH store_item AS (
  SELECT
    store_id,
    item_id,
    SUM(revenue) AS rev
  FROM sales
  GROUP BY 1, 2
), ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY rev DESC, item_id) AS rn
  FROM store_item
)
SELECT store_id, item_id, rev
FROM ranked
WHERE rn <= 2
ORDER BY store_id, rn;

NULLs, Duplicates, and Counting Correctly

COUNT(*) vs COUNT(col): COUNT(col) ignores NULLs.
LEFT JOIN filter trap: filtering on right-table columns in WHERE turns it into an INNER JOIN; move those conditions into the ON clause when you truly want a LEFT JOIN.
Distinct counting with conditions: use COUNT(DISTINCT CASE WHEN ... THEN id END).
Many-to-many joins: beware silent row multiplication; aggregate before joining when appropriate.

Query Performance & Readability (What Interviewers Notice)

Filter early: reduce data before big joins/aggregations.
Project early: select only needed columns in CTEs (especially for wide tables).
Use EXPLAIN (if available): talk through join order, scans vs indexes, and cardinality.
Prefer deterministic ordering: if you use ROW_NUMBER(), ensure tie-breakers in ORDER BY.
Consistent style: one expression per line, uppercase keywords, and clear CTE names.

Example SQL Problem

Given two tables: Users (user_id, signup_date) and Orders (order_id, user_id, order_date, amount), write a query to find the total amount spent by each user who signed up in January 2023.

SELECT
  u.user_id,
  SUM(o.amount) AS total_spent
FROM Users u
JOIN Orders o
  ON u.user_id = o.user_id
WHERE u.signup_date >= '2023-01-01'
  AND u.signup_date <  '2023-02-01'
GROUP BY 1
ORDER BY 1;

Mini Practice Set (With Solutions)

1) Daily new users: For Users(user_id, signup_ts), return signup day and count of users.

SELECT
  date_trunc('day', signup_ts) AS signup_day,
  COUNT(*) AS new_users
FROM Users
GROUP BY 1
ORDER BY 1;

2) Users with no orders (anti-join): For Users and Orders, list users who never ordered.

SELECT u.user_id
FROM Users u
LEFT JOIN Orders o
  ON u.user_id = o.user_id
WHERE o.user_id IS NULL;

3) First purchase date per user:

SELECT
  user_id,
  MIN(order_date) AS first_order_date
FROM Orders
GROUP BY 1;

4) Top 1 most recent order per user:

WITH ranked AS (
  SELECT
    o.*,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date DESC, order_id DESC) AS rn
  FROM Orders o
)
SELECT *
FROM ranked
WHERE rn = 1;

SQL & Data Manipulation

Want the Data Engineering version?

What Can You Expect?

How to Prep

Key SQL Concepts

SQL Concepts, Explained (With Examples)

1) Filtering: `WHERE` vs `HAVING`

2) JOIN types (what they return)

3) Null handling: `COALESCE` and NULL-safe math

4) Set logic: `UNION ALL` vs `UNION`

Interview-Grade SQL: Mental Model

Postgres Quick Reference (High-Value in Interviews)

Core Patterns You Will Be Asked

Window Functions: Practical Recipes

Step-by-Step Walkthrough: "Total Spend for Jan 2023 Signups"

Step-by-Step Walkthrough: Top-N Per Group (Classic Window Question)

NULLs, Duplicates, and Counting Correctly

Query Performance & Readability (What Interviewers Notice)

Example SQL Problem

Mini Practice Set (With Solutions)

Test Your Knowledge

🧠 SQL Fundamentals Quiz

SQL & Data Manipulation

Want the Data Engineering version?

What Can You Expect?

How to Prep

Key SQL Concepts

SQL Concepts, Explained (With Examples)

1) Filtering: WHERE vs HAVING

2) JOIN types (what they return)

3) Null handling: COALESCE and NULL-safe math

4) Set logic: UNION ALL vs UNION

Interview-Grade SQL: Mental Model

Postgres Quick Reference (High-Value in Interviews)

Core Patterns You Will Be Asked

Window Functions: Practical Recipes

Step-by-Step Walkthrough: "Total Spend for Jan 2023 Signups"

Step-by-Step Walkthrough: Top-N Per Group (Classic Window Question)

NULLs, Duplicates, and Counting Correctly

Query Performance & Readability (What Interviewers Notice)

Example SQL Problem

Mini Practice Set (With Solutions)

Test Your Knowledge

🧠 SQL Fundamentals Quiz

1) Filtering: `WHERE` vs `HAVING`

3) Null handling: `COALESCE` and NULL-safe math

4) Set logic: `UNION ALL` vs `UNION`