Analytical Engineering

Build reliable, versioned, tested analytics datasets (the bridge between Data Engineering and Analytics).

What is Analytical Engineering?

Analytical Engineering focuses on turning raw data into trustworthy analytics tables—clean, documented, and tested—so analysts and data scientists can move fast without re-solving data quality and definition problems in every query.

  • Inputs: raw/staging tables from ETL/ELT
  • Outputs: curated dimensions/facts, metric layers, semantic definitions
  • Tooling patterns: SQL-first transforms, version control, automated tests, CI

Core Deliverables (What You Actually Build)

  • Staging models: typed columns, standardized naming, deduplication, light cleaning.
  • Intermediate models: reusable logic (sessionization, attribution, cohort tables).
  • Marts: business-ready facts/dimensions with clear grains and stable keys.
  • Quality gates: uniqueness, not-null, referential integrity, freshness, volume anomalies.
  • Documentation: source-to-metric lineage and definitions that match stakeholder language.

Step-by-Step Tutorial: From Raw Orders to an Analytics Fact

Goal: create a trustworthy fact_orders in Postgres with deduplication, incremental loading, and checks.

Step 1 — Define the grain and keys

  • Grain: one row per order_id
  • Primary key: order_id
  • Foreign keys: user_id references dim_users(user_id)

Step 2 — Stage raw data (dedupe + type safety)

-- stg_orders: keep latest record per order_id
WITH ranked AS (
  SELECT
    r.*,
    ROW_NUMBER() OVER (
      PARTITION BY r.order_id
      ORDER BY r.updated_at DESC, r.ingested_at DESC
    ) AS rn
  FROM raw_orders r
)
SELECT
  order_id,
  user_id,
  amount::numeric(12,2) AS amount,
  updated_at::timestamptz AS updated_at
FROM ranked
WHERE rn = 1;

Step 3 — Build the fact table (idempotent upsert)

INSERT INTO analytics.fact_orders (order_id, user_id, amount, updated_at)
SELECT order_id, user_id, amount, updated_at
FROM analytics.stg_orders
ON CONFLICT (order_id) DO UPDATE
SET
  user_id = EXCLUDED.user_id,
  amount = EXCLUDED.amount,
  updated_at = EXCLUDED.updated_at
WHERE analytics.fact_orders.updated_at IS DISTINCT FROM EXCLUDED.updated_at;

Step 4 — Add quality checks (run in CI)

-- Primary key uniqueness
  SELECT COUNT(*) AS rows, COUNT(DISTINCT order_id) AS distinct_order_ids
  FROM analytics.fact_orders;

  -- No null keys
  SELECT COUNT(*) AS null_order_id
  FROM analytics.fact_orders
  WHERE order_id IS NULL;

  -- Referential integrity
  SELECT COUNT(*) AS orphan_orders
  FROM analytics.fact_orders o
  LEFT JOIN analytics.dim_users u ON o.user_id = u.user_id
  WHERE o.user_id IS NOT NULL AND u.user_id IS NULL;

Step 5 — Publish definitions

  • What is amount? Gross? Net of refunds? In what currency?
  • What is an “order”? Does it include cancelled orders?
  • What timestamp do we use for “order date”? created vs paid vs shipped?
45 mins Intermediate