SQL & Data Manipulation

Core SQL syntax, advanced techniques, query optimization, and data cleaning with SQL.

Want the Data Engineering version?

Go deeper on incremental loads, deduplication, SCD2, and data quality checks in the Analytical Engineering track:

Advanced SQL (Postgres) for Analytical Engineering →

What Can You Expect?

You can expect SQL coding questions that involve:

  • Writing complex queries, joining tables, aggregating data
  • Using window functions and optimizing query performance
  • Analyzing a large dataset or solving a business problem using SQL

How to Prep

Key SQL Concepts

  • SELECT, FROM, WHERE: Basic query structure.
  • JOINs (INNER, LEFT, RIGHT, FULL): Combining data from multiple tables.
  • GROUP BY and Aggregate Functions (COUNT, SUM, AVG, MIN, MAX): Summarizing data.
  • Window Functions (ROW_NUMBER, RANK, LAG, LEAD): Performing calculations across rows related to the current row.
  • Subqueries and CTEs (Common Table Expressions): Creating reusable query blocks.

SQL Concepts, Explained (With Examples)

1) Filtering: WHERE vs HAVING

  • WHERE filters rows before aggregation.
  • HAVING filters groups after GROUP BY.
-- WHERE filters raw rows
SELECT user_id, SUM(amount) AS total
FROM Orders
WHERE order_date >= '2025-01-01' AND order_date < '2025-02-01'
GROUP BY 1;

-- HAVING filters aggregated groups
SELECT user_id, SUM(amount) AS total
FROM Orders
GROUP BY 1
HAVING SUM(amount) >= 100;

2) JOIN types (what they return)

  • INNER JOIN: only matching keys on both sides.
  • LEFT JOIN: keep all left rows; unmatched right is NULL.
  • FULL OUTER JOIN: keep all rows from both sides (not available everywhere).
-- LEFT JOIN + ON clause filtering (keeps users with zero orders)
SELECT
  u.user_id,
  COUNT(o.order_id) AS orders_in_jan
FROM Users u
LEFT JOIN Orders o
  ON u.user_id = o.user_id
 AND o.order_date >= '2025-01-01'
 AND o.order_date <  '2025-02-01'
GROUP BY 1;

3) Null handling: COALESCE and NULL-safe math

SELECT
  user_id,
  COALESCE(SUM(amount), 0) AS total_spend
FROM Orders
GROUP BY 1;

4) Set logic: UNION ALL vs UNION

  • UNION ALL keeps duplicates (faster, preferred unless you need de-dup).
  • UNION removes duplicates (extra sort/dedup cost).

Interview-Grade SQL: Mental Model

  • Start from grain: What is one row in the final output? (user-day, order, session, etc.)
  • Separate concerns: Use CTEs to isolate filtering, joining, aggregation, and windowing.
  • Be explicit about time: Use half-open intervals (>= start and < next_day/month) to avoid boundary bugs.
  • Know your dialect: Date functions and percentiles differ. This handbook assumes Postgres unless stated otherwise.

Postgres Quick Reference (High-Value in Interviews)

  • Date bucketing: date_trunc('day'|'week'|'month', ts)
  • Half-open time filters: ts >= '2025-01-01' AND ts < '2025-02-01'
  • Conditional aggregation: COUNT(*) FILTER (WHERE ...) or SUM(CASE WHEN ... THEN 1 ELSE 0 END)
  • De-dupe shortcut: DISTINCT ON (key) ... ORDER BY key, updated_at DESC
  • Upsert: INSERT ... ON CONFLICT (key) DO UPDATE
  • NULL-safe comparison: IS DISTINCT FROM

DISTINCT ON example (latest row per user):

SELECT DISTINCT ON (user_id)
  user_id,
  plan,
  updated_at
FROM user_profiles
ORDER BY user_id, updated_at DESC;

Core Patterns You Will Be Asked

  • Conditional aggregation: multiple metrics in one pass using SUM(CASE WHEN ... THEN 1 ELSE 0 END).
  • Top-N per group: ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) then filter to rn <= N.
  • De-duplication: keep latest record per entity with ROW_NUMBER() and QUALIFY (if supported) or a wrapping CTE.
  • Retention: cohort on first event, compute period offsets, pivot/aggregate to a retention table.
  • Funnels: step completion counts per user/session, then aggregate conversion between steps.
  • Rolling metrics: 7-day rolling average/sum using window frames.

Conditional aggregation template:

SELECT
  date_trunc('day', event_ts) AS day,
  COUNT(DISTINCT user_id) AS dau,
  SUM(CASE WHEN event_name = 'purchase' THEN 1 ELSE 0 END) AS purchases,
  COUNT(DISTINCT CASE WHEN event_name = 'purchase' THEN user_id END) AS purchasing_users
FROM events
WHERE event_ts >= '2025-01-01' AND event_ts < '2025-02-01'
GROUP BY 1
ORDER BY 1;

Window Functions: Practical Recipes

1) Latest record per user (de-dupe):

WITH ranked AS (
  SELECT
    u.*,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC) AS rn
  FROM user_profiles u
)
SELECT *
FROM ranked
WHERE rn = 1;

2) Top 3 items per store by revenue:

WITH store_item AS (
  SELECT store_id, item_id, SUM(revenue) AS rev
  FROM sales
  GROUP BY 1, 2
), ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY rev DESC) AS rn
  FROM store_item
)
SELECT store_id, item_id, rev
FROM ranked
WHERE rn <= 3
ORDER BY store_id, rn;

3) 7-day rolling DAU (window frame):

WITH daily AS (
  SELECT
    date_trunc('day', event_ts) AS day,
    COUNT(DISTINCT user_id) AS dau
  FROM events
  GROUP BY 1
)
SELECT
  day,
  dau,
  AVG(dau) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS dau_7d_avg
FROM daily
ORDER BY day;

Step-by-Step Walkthrough: "Total Spend for Jan 2023 Signups"

Problem: Given Users(user_id, signup_date) and Orders(order_id, user_id, order_date, amount), compute total spend per user for users who signed up in Jan 2023.

Step 1 — Decide the output grain: one row per user_id.

Step 2 — Filter users by signup month: use a half-open interval to avoid end-of-month edge cases.

Step 3 — Join to orders: decide whether you want to keep users with zero orders. In most product analytics, you do.

Step 4 — Aggregate: sum amount per user.

WITH jan_signups AS (
  SELECT user_id
  FROM Users
  WHERE signup_date >= '2023-01-01'
    AND signup_date <  '2023-02-01'
)
SELECT
  u.user_id,
  COALESCE(SUM(o.amount), 0) AS total_spent
FROM jan_signups u
LEFT JOIN Orders o
  ON u.user_id = o.user_id
GROUP BY 1
ORDER BY 1;

Common follow-up questions:

  • If you only want spend within January, add an order_date filter (in the ON clause if keeping zero-order users).
  • If Orders has refunds/chargebacks, define whether to net them out or filter them.

Step-by-Step Walkthrough: Top-N Per Group (Classic Window Question)

Problem: “For each store_id, return the top 2 item_id by revenue.”

Step 1 — Aggregate to the ranking grain: store-item.

Step 2 — Rank within each store: ROW_NUMBER() with a deterministic tie-breaker if needed.

Step 3 — Filter to top N: rn <= 2.

WITH store_item AS (
  SELECT
    store_id,
    item_id,
    SUM(revenue) AS rev
  FROM sales
  GROUP BY 1, 2
), ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY rev DESC, item_id) AS rn
  FROM store_item
)
SELECT store_id, item_id, rev
FROM ranked
WHERE rn <= 2
ORDER BY store_id, rn;

NULLs, Duplicates, and Counting Correctly

  • COUNT(*) vs COUNT(col): COUNT(col) ignores NULLs.
  • LEFT JOIN filter trap: filtering on right-table columns in WHERE turns it into an INNER JOIN; move those conditions into the ON clause when you truly want a LEFT JOIN.
  • Distinct counting with conditions: use COUNT(DISTINCT CASE WHEN ... THEN id END).
  • Many-to-many joins: beware silent row multiplication; aggregate before joining when appropriate.

Query Performance & Readability (What Interviewers Notice)

  • Filter early: reduce data before big joins/aggregations.
  • Project early: select only needed columns in CTEs (especially for wide tables).
  • Use EXPLAIN (if available): talk through join order, scans vs indexes, and cardinality.
  • Prefer deterministic ordering: if you use ROW_NUMBER(), ensure tie-breakers in ORDER BY.
  • Consistent style: one expression per line, uppercase keywords, and clear CTE names.

Example SQL Problem

Given two tables: Users (user_id, signup_date) and Orders (order_id, user_id, order_date, amount), write a query to find the total amount spent by each user who signed up in January 2023.

SELECT
  u.user_id,
  SUM(o.amount) AS total_spent
FROM Users u
JOIN Orders o
  ON u.user_id = o.user_id
WHERE u.signup_date >= '2023-01-01'
  AND u.signup_date <  '2023-02-01'
GROUP BY 1
ORDER BY 1;

Mini Practice Set (With Solutions)

1) Daily new users: For Users(user_id, signup_ts), return signup day and count of users.

SELECT
  date_trunc('day', signup_ts) AS signup_day,
  COUNT(*) AS new_users
FROM Users
GROUP BY 1
ORDER BY 1;

2) Users with no orders (anti-join): For Users and Orders, list users who never ordered.

SELECT u.user_id
FROM Users u
LEFT JOIN Orders o
  ON u.user_id = o.user_id
WHERE o.user_id IS NULL;

3) First purchase date per user:

SELECT
  user_id,
  MIN(order_date) AS first_order_date
FROM Orders
GROUP BY 1;

4) Top 1 most recent order per user:

WITH ranked AS (
  SELECT
    o.*,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date DESC, order_id DESC) AS rn
  FROM Orders o
)
SELECT *
FROM ranked
WHERE rn = 1;

Test Your Knowledge

🧠 SQL Fundamentals Quiz

Test your understanding of core SQL concepts including JOINs, aggregations, and window functions.

1 Which JOIN type returns all rows from the left table and matching rows from the right table?

2 What is the difference between WHERE and HAVING clauses?

3 Which window function assigns a unique sequential integer to rows within a partition?

4 What does a Common Table Expression (CTE) allow you to do?

5 If you want to find the second highest salary in an employee table, which approach would work?

60 mins Intermediate