Advanced SQL (Postgres) for Analytical Engineering
Production-grade SQL patterns for reliable analytics datasets: layered modeling, idempotent loads, de-duplication, SCD2, QA checks, and performance.
Analytical Engineering Mental Model
- Contract: define grains, keys, and metric definitions.
- Layering:
raw→staging→intermediate→marts(facts/dims). - Idempotent builds: re-running should converge to the same truth.
- Quality gates: tests run every build; failures block deploy.
- Performance: indexes, pre-aggregation, and avoiding join fanout.
1) Postgres Power Tools (Worth Memorizing)
COUNT(*) FILTER (WHERE ...): clean conditional aggregation.DISTINCT ON: fastest readable “latest per key” pattern.INSERT ... ON CONFLICT: practical upsert for idempotent loads.IS DISTINCT FROM: NULL-safe comparisons in updates/joins.generate_series: create date spines for complete time series.EXPLAIN (ANALYZE, BUFFERS): diagnose scans vs indexes and row estimates.
2) Tutorial: Build a Trustworthy Orders Mart (End-to-End)
Scenario: You ingest raw_orders and want a stable analytics.fact_orders used by analysts.
Step 0 — Table conventions
- raw: append-only, may contain duplicates and schema drift.
- staging: typed, standardized names, deduplicated.
- mart: business-ready columns, keys, and constraints.
Step 1 — Staging: latest record wins
-- stg_orders: one row per order_id
CREATE SCHEMA IF NOT EXISTS analytics;
CREATE OR REPLACE VIEW analytics.stg_orders AS
WITH ranked AS (
SELECT
r.*,
ROW_NUMBER() OVER (
PARTITION BY r.order_id
ORDER BY r.updated_at DESC, r.ingested_at DESC
) AS rn
FROM raw_orders r
)
SELECT
order_id,
user_id,
amount::numeric(12,2) AS amount,
status::text AS status,
created_at::timestamptz AS created_at,
updated_at::timestamptz AS updated_at
FROM ranked
WHERE rn = 1;
Step 2 — Mart table: constraints are part of data quality
CREATE TABLE IF NOT EXISTS analytics.fact_orders (
order_id text PRIMARY KEY,
user_id text,
amount numeric(12,2) NOT NULL,
status text NOT NULL,
created_at timestamptz NOT NULL,
updated_at timestamptz NOT NULL
);
CREATE INDEX IF NOT EXISTS fact_orders_user_id_idx ON analytics.fact_orders(user_id);
CREATE INDEX IF NOT EXISTS fact_orders_created_at_idx ON analytics.fact_orders(created_at);
Step 3 — Idempotent upsert load
INSERT INTO analytics.fact_orders (order_id, user_id, amount, status, created_at, updated_at)
SELECT order_id, user_id, amount, status, created_at, updated_at
FROM analytics.stg_orders
ON CONFLICT (order_id) DO UPDATE
SET
user_id = EXCLUDED.user_id,
amount = EXCLUDED.amount,
status = EXCLUDED.status,
created_at = EXCLUDED.created_at,
updated_at = EXCLUDED.updated_at
WHERE analytics.fact_orders.updated_at IS DISTINCT FROM EXCLUDED.updated_at;
Step 4 — Watermarking (incremental)
In Postgres you usually keep the watermark in an orchestration system (Airflow, Dagster) or a control table.
CREATE TABLE IF NOT EXISTS analytics.job_control (
job_name text PRIMARY KEY,
last_success_ts timestamptz NOT NULL
);
-- Example: read watermark
SELECT last_success_ts
FROM analytics.job_control
WHERE job_name = 'orders_mart';
Step 5 — Tests you can run as queries (CI gates)
-- 1) PK uniqueness (should match)
SELECT COUNT(*) AS rows, COUNT(DISTINCT order_id) AS distinct_order_ids
FROM analytics.fact_orders;
-- 2) No negative amounts (example business rule)
SELECT COUNT(*) AS bad_rows
FROM analytics.fact_orders
WHERE amount < 0;
-- 3) Valid statuses (example enum-like rule)
SELECT status, COUNT(*)
FROM analytics.fact_orders
GROUP BY 1
HAVING status NOT IN ('created','paid','shipped','refunded','cancelled');
3) SCD Type 2 in Postgres (Practical Pattern)
Goal: keep history for changing attributes (e.g., user plan).
Table: dim_user_plan(user_id, plan, valid_from, valid_to, is_current)
Step-by-step: identify changes → expire current → insert new current. Wrap in a transaction in real jobs.
-- Find changed users
WITH current_dim AS (
SELECT * FROM analytics.dim_user_plan WHERE is_current = true
), changes AS (
SELECT s.user_id, s.plan
FROM analytics.stg_user_plan s
LEFT JOIN current_dim d
ON s.user_id = d.user_id
WHERE d.user_id IS NULL OR s.plan IS DISTINCT FROM d.plan
)
SELECT * FROM changes;
-- Expire current rows for changed users
UPDATE analytics.dim_user_plan d
SET valid_to = NOW(), is_current = false
WHERE is_current = true
AND EXISTS (
SELECT 1
FROM analytics.stg_user_plan s
WHERE s.user_id = d.user_id
AND s.plan IS DISTINCT FROM d.plan
);
-- Insert new current rows
INSERT INTO analytics.dim_user_plan (user_id, plan, valid_from, valid_to, is_current)
SELECT s.user_id, s.plan, NOW(), NULL, true
FROM analytics.stg_user_plan s
LEFT JOIN analytics.dim_user_plan d
ON s.user_id = d.user_id AND d.is_current = true
WHERE d.user_id IS NULL OR s.plan IS DISTINCT FROM d.plan;
4) Time Series Completeness With a Date Spine
Problem: charts lie when missing days disappear. Use a date spine and LEFT JOIN.
WITH spine AS (
SELECT d::date AS day
FROM generate_series(date '2025-01-01', date '2025-01-31', interval '1 day') AS t(d)
), daily AS (
SELECT date_trunc('day', created_at)::date AS day, COUNT(*) AS orders
FROM analytics.fact_orders
GROUP BY 1
)
SELECT s.day, COALESCE(d.orders, 0) AS orders
FROM spine s
LEFT JOIN daily d USING (day)
ORDER BY s.day;
5) Avoiding Join Fanout (A Silent Analytics Killer)
Rule: if you join facts to a dimension, the dimension side should be unique on the join key.
Detect non-unique dimension keys:
SELECT user_id, COUNT(*)
FROM analytics.dim_users
GROUP BY 1
HAVING COUNT(*) > 1
ORDER BY 2 DESC
LIMIT 50;
Fix: dedupe the dimension to one row per key (e.g., latest wins) before joining.
WITH dim_users_latest AS (
SELECT DISTINCT ON (user_id)
user_id,
country,
updated_at
FROM analytics.dim_users
ORDER BY user_id, updated_at DESC
)
SELECT o.order_id, u.country
FROM analytics.fact_orders o
LEFT JOIN dim_users_latest u ON o.user_id = u.user_id;
6) Performance Notes (Postgres)
- Indexes: index join keys and common time filters; validate with
EXPLAIN (ANALYZE). - Large tables: consider partitioning by date for very large facts (range partitions).
- Materialized views: good for expensive aggregates used repeatedly; refresh on schedule.
- Vacuum/analyze: keep stats fresh so the planner chooses good plans.