🎯 2026 New Year Resolution Challenge: Level Up Your Analytical Skills

Official Cohort Start Dates: January 1, 2026 | April 1, 2026 | September 1, 2026
Duration: 8 weeks
Commitment: ~1 hour/day on weekdays; flexible weekends for catch-up/projects
Goal: Enter Q2 2026 with demonstrably stronger analytics, SQL, Python, statistics, AI-augmented skills, and storytelling abilities.

💡 New for 2026: This challenge now includes an AI-Augmented Analyst Track — learn to leverage GenAI tools as a force multiplier, not a crutch.


🚀 Ready to Transform Your 2026?

Join hundreds of aspiring data professionals in this 8-week challenge.

📝 Sign Up & Introduce Yourself

🚀 Quick Start & Environment Setup

One-Click Environments (Zero Local Setup)

Skip the pip install struggles! Use these browser-based environments:

Open in GitHub Codespaces Open In Colab Try DuckDB in Browser

For SQL Practice (Weeks 1-2):

  • Use DuckDB WASM Shell — runs SQL in your browser, no setup!
  • Load our sample data: CREATE TABLE customers AS SELECT * FROM 'https://raw.githubusercontent.com/moshesham/Data-Science-Analytical-Handbook/main/supplementary/challenge-2026/data/customers.csv';

For Python Practice (Weeks 5, 8):

  • Use Google Colab — free Jupyter notebooks in the cloud
  • Or GitHub Codespaces — full VS Code in browser

Local Setup (Optional)

If you prefer local development:

# Clone the repository
git clone https://github.com/moshesham/Data-Science-Analytical-Handbook.git
cd Data-Science-Analytical-Handbook/supplementary/challenge-2026

# Option 1: pip
pip install -r requirements.txt

# Option 2: conda
conda env create -f environment.yml
conda activate analytics-challenge-2026

Sample Data

We’ve included ready-to-use datasets in data/:

  • customers.csv, orders.csv, products.csv — for SQL practice
  • user_activity.csv — for cohort analysis
  • daily_metrics.csv, monthly_revenue.csv — for time series analysis

🤖 The AI-Augmented Analyst Track (NEW)

In 2026, analysts who can leverage AI tools effectively will have a massive advantage. This track teaches you to use GenAI as a force multiplier — not a replacement for critical thinking.

How It Works

Every week includes an optional AI challenge that teaches responsible AI-augmented analysis:

Week Traditional Skill AI-Augmented Challenge
1 SQL Fundamentals Prompt Engineering for SQL — Write a complex query, then ask Copilot/ChatGPT to optimize it. Compare execution plans using EXPLAIN ANALYZE.
2 Window Functions SQL Review with AI — Generate window function queries with AI, then critique and fix the errors it makes.
3 Statistics Synthetic Data Generation — Use Python faker or an LLM to generate a dataset with specific statistical properties.
4 A/B Testing Experiment Design Review — Use AI to critique your experiment design doc. Challenge its suggestions.
5 Pandas/EDA Refactoring with AI — Take your “spaghetti code” from Day 30 and use an AI agent to modularize it.
6 Product Metrics Simulated Stakeholder — Submit your case study answer to ChatGPT playing the role of a skeptical PM.
7 Analytical Engineering dbt Model Generation — Use AI to scaffold a dbt project, then customize and validate.
8 Capstone AI-Assisted Storytelling — Use AI to critique your presentation structure, then improve it.

Key Principles for AI-Augmented Analysis

  1. AI as Co-Pilot, Not Autopilot — Always verify AI outputs; it makes confident mistakes.
  2. Prompt Engineering is a Skill — Learn to write precise prompts that constrain the output.
  3. Domain Knowledge Matters More — AI doesn’t know your business context; you do.
  4. Show Your Reasoning — In interviews, explain how you validated AI-generated code.

🏆 Earn Your Challenge Badge

Complete the challenge and showcase your achievement! Add this badge to your GitHub profile README:

Challenge Badges

Participant Badge (Week 1+):

![2026 Analytics Challenge](https://img.shields.io/badge/2026_Analytics_Challenge-Participant-blue?style=for-the-badge&logo=github&logoColor=white)

2026 Analytics Challenge

Halfway Hero (Week 4 Complete):

![2026 Analytics Challenge](https://img.shields.io/badge/2026_Analytics_Challenge-Halfway_Hero-orange?style=for-the-badge&logo=github&logoColor=white)

2026 Analytics Challenge

Challenge Champion (Full Completion):

![2026 Analytics Challenge](https://img.shields.io/badge/2026_Analytics_Challenge-Champion_🏆-gold?style=for-the-badge&logo=github&logoColor=black)

2026 Analytics Challenge

Weekly Skill Badges:

![SQL Mastery](https://img.shields.io/badge/SQL-Mastery-4479A1?style=flat-square&logo=postgresql&logoColor=white)
![Statistics Pro](https://img.shields.io/badge/Statistics-Pro-success?style=flat-square&logo=python&logoColor=white)
![Python Pandas](https://img.shields.io/badge/Pandas-Expert-150458?style=flat-square&logo=pandas&logoColor=white)
![A/B Testing](https://img.shields.io/badge/A%2FB_Testing-Certified-blueviolet?style=flat-square)
![AI-Augmented](https://img.shields.io/badge/AI_Augmented-Analyst-FF6F61?style=flat-square&logo=openai&logoColor=white)

📜 Digital Certificate

Upon completing the capstone (Week 8), you’ll receive a digital certificate to showcase on your LinkedIn profile:

  1. Post your capstone to the Week 8 Discussion thread
  2. Include these components: GitHub repo link, presentation slides/video, key insights
  3. Receive your certificate — we’ll generate a personalized PDF certificate for verified completions

🎓 Certificates include your name, completion date, capstone project title, and a verification link.


💬 Join the Community

Daily Check-Ins

Post your daily progress in our GitHub Discussions! Each day, share:

  • ✅ What you completed
  • 💡 One thing you learned
  • ❓ Questions for the community

Post Your Daily Check-In

Weekly Threads

Every Monday, a new weekly thread goes live. Use it to:

  • Share your mini-project
  • Get feedback on your code
  • Help others who are stuck
  • Celebrate wins! 🎉
Week Discussion Thread
1 SQL Fundamentals Check-In
2 Window Functions Show & Tell
3 Stats Concepts Discussion
4 Experiment Design Review
5 EDA Notebook Showcase
6 Case Study Solutions
7 Pipeline Demos
8 Capstone Presentations 🎓

Help Others, Level Up Together

  • 👍 Upvote helpful answers
  • 💬 Comment with encouragement
  • 🔧 Review a peer’s SQL query or notebook
  • 🏅 Top contributors get featured in the repo!

Why This Challenge Works

Principle How It’s Applied
Consistency beats intensity 1 focused hour daily > weekend cram sessions
Learn by doing Every day has a concrete deliverable
Public accountability Track progress on GitHub/LinkedIn; share wins
Build a portfolio End with 4 polished mini-projects to show

Challenge Overview

Week Theme Outcome 🆕 2026 Innovation
1 SQL Fundamentals Refresher Solve 15 SQL problems, from easy → medium Query Cost Optimization with EXPLAIN ANALYZE
2 SQL Mastery: Window Functions & CTEs Complete 10 advanced problems; build a cohort/retention query Visual SQL — see query results as charts
3 Statistics for Product Analytics Explain p-values, CI, power; design a sample size calculator Peeking Simulation — visualize why peeking is bad
4 A/B Testing & Experimentation Design & analyze a mock experiment end-to-end Multi-Armed Bandits — modern alternative to A/B
5 Python (Pandas) for Analysis Clean, explore, and visualize a real dataset Polars — high-performance DataFrame alternative
6 Product Metrics & Case Studies Define metrics frameworks; crack 3 case studies Metric Trees — hierarchical metric visualization
7 Analytical Engineering Basics Build a staging → mart pipeline in SQL dbt Core — industry-standard transformation tool
8 Capstone & Storytelling Deliver a 5-slide insight presentation Video Walkthrough — 3-min Loom requirement

Week 1: SQL Fundamentals Refresher

Goal: Knock the rust off core SQL and solve problems quickly.

📂 Sample Data: Use our ready-to-use datasets in challenge-2026/data/ — no downloading required!

Day Focus Do This Resource
1 SELECT, WHERE, ORDER Solve 3 easy SQL problems on DataLemur or LeetCode DataLemur
2 JOINs (INNER, LEFT) Diagram how LEFT JOIN works; solve 3 join problems SQLBolt
3 GROUP BY, HAVING Solve 3 aggregation problems; explain the difference from WHERE Mode Tutorial
4 Subqueries Rewrite a subquery as a CTE; solve 2 problems both ways LeetCode
5 CTEs + EXPLAIN ANALYZE Solve 3 CTE problems; learn to read execution plans DataLemur
6–7 Catch-up / Mini-project Write a single CTE-based query that answers a business question on a public dataset (e.g., Kaggle) Kaggle Datasets

🆕 Query Cost Optimization (Distinguishes Senior Analysts)

Learning to read EXPLAIN ANALYZE output early sets you apart. Here’s why it matters:

-- Run this on any query to see its execution plan
EXPLAIN ANALYZE
SELECT c.name, COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name
ORDER BY order_count DESC;

What to look for:

  • Seq Scan vs Index Scan — Index scans are usually faster for filtered queries
  • Nested Loop vs Hash Join — Hash joins are better for large tables
  • Actual Time — The “actual time” tells you how long each step took
  • Rows — Compare “estimated” vs “actual” rows to spot statistics issues

Pro Tip: In DuckDB (browser-based), use EXPLAIN to see the logical plan:

EXPLAIN SELECT * FROM orders WHERE amount > 100;

🤖 AI-Augmented Challenge (Week 1)

Prompt Engineering for SQL:

  1. Write a complex query with 3+ JOINs and aggregations
  2. Ask ChatGPT/Copilot: “Optimize this query for performance. Explain what you changed and why.”
  3. Compare the AI’s version to yours using EXPLAIN ANALYZE
  4. Reflection: Did the AI improve performance? Did it break correctness?

🧠 Week 1 Challenge Problems

Challenge 1: Customer Order Analysis (Click to expand) **Problem:** Given tables `customers` and `orders`, find customers who placed more than 3 orders in the last 30 days but have never ordered on a weekend. ```sql -- Schema CREATE TABLE customers (customer_id INT, name TEXT, signup_date DATE); CREATE TABLE orders (order_id INT, customer_id INT, order_date DATE, amount DECIMAL); -- Sample Data INSERT INTO customers VALUES (1, 'Alice', '2024-01-15'), (2, 'Bob', '2024-06-01'); INSERT INTO orders VALUES (101, 1, '2025-12-01', 50), (102, 1, '2025-12-05', 30), (103, 1, '2025-12-10', 45), (104, 1, '2025-12-15', 60), (105, 2, '2025-12-07', 100); -- Saturday ``` **Your Task:** Write the query before looking at the solution.
✅ Solution ```sql WITH recent_orders AS ( SELECT customer_id, COUNT(*) AS order_count, COUNT(*) FILTER (WHERE EXTRACT(DOW FROM order_date) IN (0, 6)) AS weekend_orders FROM orders WHERE order_date >= CURRENT_DATE - INTERVAL '30 days' GROUP BY customer_id ) SELECT c.customer_id, c.name, ro.order_count FROM customers c JOIN recent_orders ro ON c.customer_id = ro.customer_id WHERE ro.order_count > 3 AND ro.weekend_orders = 0; ``` **Key Concepts:** FILTER clause, EXTRACT(DOW), date arithmetic, CTE for readability.
Challenge 2: Revenue Gap Analysis **Problem:** Find all dates in December 2025 where no orders were placed. Return the date and which day of the week it was. **Your Task:** This requires generating a date series. Think about how to create dates that don't exist in your data.
✅ Solution ```sql WITH date_spine AS ( SELECT generate_series('2025-12-01'::date, '2025-12-31'::date, '1 day')::date AS dt ), order_dates AS ( SELECT DISTINCT order_date FROM orders WHERE order_date BETWEEN '2025-12-01' AND '2025-12-31' ) SELECT ds.dt AS missing_date, TO_CHAR(ds.dt, 'Day') AS day_of_week FROM date_spine ds LEFT JOIN order_dates od ON ds.dt = od.order_date WHERE od.order_date IS NULL ORDER BY ds.dt; ``` **Key Concepts:** generate_series (Postgres), date spine pattern, LEFT JOIN to find gaps.
Challenge 3: Self-Join for User Referrals **Problem:** Table `users` has columns (user_id, name, referred_by_user_id). Find all users who referred at least 2 other users, along with the names of who they referred.
✅ Solution ```sql WITH referral_counts AS ( SELECT referred_by_user_id AS referrer_id, COUNT(*) AS referral_count FROM users WHERE referred_by_user_id IS NOT NULL GROUP BY referred_by_user_id HAVING COUNT(*) >= 2 ) SELECT referrer.name AS referrer_name, STRING_AGG(referred.name, ', ' ORDER BY referred.name) AS referred_users, rc.referral_count FROM referral_counts rc JOIN users referrer ON rc.referrer_id = referrer.user_id JOIN users referred ON referred.referred_by_user_id = rc.referrer_id GROUP BY referrer.name, rc.referral_count; ``` **Key Concepts:** Self-join, HAVING for post-aggregation filter, STRING_AGG for concatenation.

💬 Week 1 Discussion Prompts

Post your answers in the Week 1 Discussion Thread:

  1. “What’s the most confusing JOIN scenario you’ve encountered?” — Share and help others understand.
  2. “Subquery vs CTE: When do you choose which?” — There’s no single right answer; discuss trade-offs.
  3. “Share your Day 6-7 mini-project query” — Get peer feedback on readability and correctness.

⚠️ Common Week 1 Mistakes

Mistake Why It’s Wrong Fix
WHERE after GROUP BY for aggregates WHERE filters rows before grouping Use HAVING for aggregate conditions
Forgetting NULL handling in JOINs LEFT JOIN creates NULLs; comparisons fail Use COALESCE() or IS NULL checks
Not aliasing tables in self-joins Query becomes unreadable, may error Always alias: users u1 JOIN users u2
Using SELECT * in production Performance issues, schema changes break code Explicitly list needed columns

✅ Week 1 Self-Assessment

Before moving to Week 2, can you:

  • Write a 3-table JOIN without looking up syntax?
  • Explain the difference between WHERE and HAVING in one sentence?
  • Convert any subquery to a CTE (and know when not to)?
  • Use COALESCE to handle NULLs in your output?

Milestone: 15 SQL problems solved; 1 mini-project query committed to GitHub.


Week 2: SQL Mastery – Window Functions & CTEs

Goal: Become fluent in ROW_NUMBER, RANK, LAG/LEAD, running totals, and cohort queries.

Day Focus Do This Resource
8 ROW_NUMBER, RANK, DENSE_RANK Explain the differences; solve 2 “top-N per group” problems Mode Window Funcs
9 LAG / LEAD Compute day-over-day change; solve 2 problems StrataScratch
10 Running Totals / Averages Use SUM() OVER with frames; solve 2 problems DataLemur
11 NTILE, PERCENT_RANK Bucket users into deciles; explain use case Mode Tutorial
12 Cohort & Retention Query Build a retention matrix query for a sample events table Handbook: Cohort Analysis
13–14 Catch-up / Mini-project Publish a cohort retention query + heatmap visual (Python or Sheets) Your GitHub

🆕 Visual SQL (See Your Data Immediately)

Don’t just write queries — visualize the results to build intuition. Window functions are much easier to understand when you can see the data change.

Option 1: DuckDB + Python (Recommended)

import duckdb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data and run window function query
con = duckdb.connect()
df = con.execute("""
    SELECT 
        dt,
        dau,
        AVG(dau) OVER (ORDER BY dt ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS trailing_7d_avg
    FROM 'data/daily_metrics.csv'
    ORDER BY dt
""").fetchdf()

# Visualize immediately
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(df['dt'], df['dau'], label='Daily DAU', alpha=0.7)
ax.plot(df['dt'], df['trailing_7d_avg'], label='7-Day Trailing Avg', linewidth=2)
ax.legend()
ax.set_title('Daily Active Users with 7-Day Moving Average')
plt.show()

Option 2: Metabase (Free, Local)

  • Run docker run -d -p 3000:3000 metabase/metabase
  • Connect to your local database
  • Build charts directly from SQL queries

🤖 AI-Augmented Challenge (Week 2)

SQL Code Review with AI:

  1. Write a complex window function query (e.g., cohort retention)
  2. Ask ChatGPT: “Review this SQL query for correctness, performance, and readability. Point out any edge cases I might have missed.”
  3. Critical step: Test the AI’s suggestions! Does the query still produce correct results?
  4. Reflection: What did the AI catch that you missed? What did it get wrong?

🧠 Week 2 Challenge Problems

Challenge 1: Top 3 Products per Category (Classic Interview Question) **Problem:** Given a `products` table with (product_id, category, product_name, revenue), find the top 3 products by revenue in each category. Handle ties by including all tied products. ```sql -- Sample Data CREATE TABLE products ( product_id INT, category TEXT, product_name TEXT, revenue DECIMAL ); INSERT INTO products VALUES (1, 'Electronics', 'Laptop', 1200), (2, 'Electronics', 'Phone', 800), (3, 'Electronics', 'Tablet', 800), -- Tie with Phone (4, 'Electronics', 'Headphones', 150), (5, 'Electronics', 'Watch', 300), (6, 'Clothing', 'Jacket', 200), (7, 'Clothing', 'Shoes', 180); ```
✅ Solution ```sql WITH ranked AS ( SELECT *, DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS rnk FROM products ) SELECT category, product_name, revenue, rnk FROM ranked WHERE rnk <= 3 ORDER BY category, rnk; ``` **Why DENSE_RANK?** If two products tie for 2nd place, the next product is still 3rd (not 4th). Use `RANK()` if you want to skip numbers, `ROW_NUMBER()` if you want exactly 3 rows regardless of ties. | Function | Tie Handling | Use Case | |----------|--------------|----------| | ROW_NUMBER | No ties (arbitrary) | Exactly N rows needed | | RANK | Skips numbers after tie | Competition rankings | | DENSE_RANK | No gaps after tie | "Top N including ties" |
Challenge 2: Month-over-Month Growth Rate **Problem:** Calculate the month-over-month revenue growth percentage. Handle the first month (no prior month) gracefully. ```sql CREATE TABLE monthly_revenue (month DATE, revenue DECIMAL); INSERT INTO monthly_revenue VALUES ('2025-01-01', 10000), ('2025-02-01', 12000), ('2025-03-01', 11500), ('2025-04-01', 15000); ```
✅ Solution ```sql SELECT month, revenue, LAG(revenue) OVER (ORDER BY month) AS prev_revenue, ROUND( 100.0 * (revenue - LAG(revenue) OVER (ORDER BY month)) / NULLIF(LAG(revenue) OVER (ORDER BY month), 0), 2 ) AS mom_growth_pct FROM monthly_revenue ORDER BY month; ``` **Key Concepts:** - `LAG(revenue) OVER (ORDER BY month)` gets the previous row's value - `NULLIF(..., 0)` prevents division by zero - First row will show `NULL` for growth (no prior month)
Challenge 3: Running 7-Day Average (Real-World Analytics) **Problem:** Calculate a 7-day trailing average of daily active users (DAU). This is extremely common in product analytics dashboards. ```sql CREATE TABLE daily_metrics (dt DATE, dau INT); INSERT INTO daily_metrics VALUES ('2025-12-01', 1000), ('2025-12-02', 1050), ('2025-12-03', 980), ('2025-12-04', 1100), ('2025-12-05', 1200), ('2025-12-06', 900), ('2025-12-07', 1150), ('2025-12-08', 1300), ('2025-12-09', 1250); ```
✅ Solution ```sql SELECT dt, dau, ROUND( AVG(dau) OVER ( ORDER BY dt ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ), 2 ) AS trailing_7d_avg, COUNT(*) OVER ( ORDER BY dt ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) AS days_in_window FROM daily_metrics ORDER BY dt; ``` **Key Insight:** The `ROWS BETWEEN 6 PRECEDING AND CURRENT ROW` creates a 7-day window (current + 6 prior). The `days_in_window` column shows how many days are actually in the calculation—useful for the first 6 days when the window is partial. **Alternative with RANGE (be careful!):** ```sql -- RANGE looks at VALUES, not row positions -- Use ROWS for day-based windows to handle missing dates correctly ```
Challenge 4: Cohort Retention Matrix (Advanced) **Problem:** Build a retention matrix showing what percentage of users from each signup cohort (month) were active in subsequent months. ```sql CREATE TABLE user_activity ( user_id INT, signup_date DATE, activity_date DATE ); -- Assume data for users who signed up and logged in on various dates ```
✅ Solution ```sql WITH cohorts AS ( SELECT user_id, DATE_TRUNC('month', signup_date) AS cohort_month FROM user_activity GROUP BY user_id, DATE_TRUNC('month', signup_date) ), activity AS ( SELECT user_id, DATE_TRUNC('month', activity_date) AS activity_month FROM user_activity GROUP BY user_id, DATE_TRUNC('month', activity_date) ), cohort_activity AS ( SELECT c.cohort_month, a.activity_month, COUNT(DISTINCT c.user_id) AS active_users, -- Months since signup (0 = signup month) EXTRACT(YEAR FROM AGE(a.activity_month, c.cohort_month)) * 12 + EXTRACT(MONTH FROM AGE(a.activity_month, c.cohort_month)) AS month_number FROM cohorts c JOIN activity a ON c.user_id = a.user_id AND a.activity_month >= c.cohort_month GROUP BY c.cohort_month, a.activity_month ), cohort_sizes AS ( SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size FROM cohorts GROUP BY cohort_month ) SELECT ca.cohort_month, ca.month_number, ca.active_users, cs.cohort_size, ROUND(100.0 * ca.active_users / cs.cohort_size, 1) AS retention_pct FROM cohort_activity ca JOIN cohort_sizes cs ON ca.cohort_month = cs.cohort_month ORDER BY ca.cohort_month, ca.month_number; ``` **Visualization:** Take this output and pivot it in Python/Excel to create a retention heatmap where rows = cohort months, columns = month_number (0, 1, 2...), cells = retention_pct.

💬 Week 2 Discussion Prompts

  1. “ROWS vs RANGE in window frames—when does it matter?” — Share an example where they produce different results.
  2. “What’s the trickiest window function problem you’ve solved?” — Post the problem and your approach.
  3. “Share your cohort retention visualization” — Get feedback on clarity and insights.

⚠️ Common Week 2 Mistakes

Mistake Why It’s Wrong Fix
Forgetting PARTITION BY Window function runs over entire table Add PARTITION BY for per-group calculations
Using RANK when you need exactly N RANK can return more than N rows on ties Use ROW_NUMBER for strict limit
Window frame includes future rows Default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW Explicitly set ROWS BETWEEN
Calculating retention without distinct users One user with 10 logins ≠ 10 retained users Always COUNT(DISTINCT user_id)

✅ Week 2 Self-Assessment

Before moving to Week 3, can you:

  • Explain the difference between RANK, DENSE_RANK, and ROW_NUMBER with an example?
  • Calculate a trailing 7-day average using the correct window frame?
  • Build a cohort retention query from scratch?
  • Use LAG/LEAD to compute period-over-period changes?

Milestone: 10 advanced SQL problems; retention query + visualization.


Week 3: Statistics for Product Analytics

Goal: Build intuition for the stats used every day in experimentation.

Day Focus Do This Resource
15 Distributions: Normal, Binomial Watch StatQuest; sketch 3 product examples StatQuest: Normal
16 Central Limit Theorem Simulate sample means in Python; see CLT in action Handbook: Statistics
17 Hypothesis Testing (t-test, z-test) Calculate by hand; then verify with scipy Handbook: Hypothesis Testing
18 p-values & Confidence Intervals Write a 1-paragraph explanation a PM would understand Own notes
19 Power & Sample Size Use the handbook calculator or write your own formula Sample Size Calculator
20–21 Catch-up / Mini-project Create a Python script or notebook that computes sample size given α, β, baseline, MDE Your GitHub

🆕 The Peeking Simulation (Why Early Stopping is Dangerous)

This visual simulation proves why “peeking” at A/B test results inflates false positive rates. Run it yourself to build intuition.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

def simulate_peeking(n_simulations=1000, true_effect=0, 
                     sample_size_per_peek=100, n_peeks=10, alpha=0.05):
    """
    Simulate A/B tests with peeking vs waiting for full sample.
    true_effect=0 means there's NO real effect (null hypothesis is true).
    """
    false_positives_peeking = 0
    false_positives_proper = 0
    
    for _ in range(n_simulations):
        # Generate full dataset for both groups
        control = np.random.normal(0, 1, sample_size_per_peek * n_peeks)
        treatment = np.random.normal(true_effect, 1, sample_size_per_peek * n_peeks)
        
        # Peeking: check at each peek point, stop if significant
        peeking_significant = False
        for peek in range(1, n_peeks + 1):
            n = sample_size_per_peek * peek
            _, p_value = stats.ttest_ind(control[:n], treatment[:n])
            if p_value < alpha:
                peeking_significant = True
                break  # Stop at first significant result
        
        if peeking_significant:
            false_positives_peeking += 1
        
        # Proper: wait for full sample
        _, p_value_full = stats.ttest_ind(control, treatment)
        if p_value_full < alpha:
            false_positives_proper += 1
    
    return {
        'peeking_fp_rate': false_positives_peeking / n_simulations,
        'proper_fp_rate': false_positives_proper / n_simulations
    }

# Run simulation
results = simulate_peeking(n_simulations=1000, true_effect=0, n_peeks=10)
print(f"False Positive Rate (Peeking): {results['peeking_fp_rate']:.1%}")
print(f"False Positive Rate (Proper):  {results['proper_fp_rate']:.1%}")
print(f"Expected (α=0.05):             5.0%")

Expected Output:

False Positive Rate (Peeking): ~20-30%  ← Much higher than expected!
False Positive Rate (Proper):  ~5%      ← As expected
Expected (α=0.05):             5.0%

The Insight: With 10 peeks, your false positive rate can inflate to 20-30% even though you’re using α=0.05. This is why pre-registration and sequential testing methods matter.

🤖 AI-Augmented Challenge (Week 3)

Synthetic Data Generation:

  1. Use Python’s faker library or an LLM to generate a dataset with specific properties:
    • 10,000 users
    • Conversion rate of 3.5% with a specific distribution
    • Include confounding variables (e.g., users from paid ads convert higher)
  2. Prompt for LLM: “Generate Python code to create a synthetic A/B test dataset with 10,000 users, a control group with 3% conversion and treatment with 3.5% conversion, including user attributes like signup source and device type that correlate with conversion.”
  3. Run your hypothesis testing code on the synthetic data
  4. Reflection: Did you detect the 0.5 percentage point difference? What sample size did you need?

🧠 Week 3 Challenge Problems

Challenge 1: Explain This to a PM (Communication Exercise) **Scenario:** Your PM asks: "We ran an A/B test and got a p-value of 0.03. The data scientist says it's significant but we should wait for more data. That doesn't make sense—can you explain?" **Your Task:** Write a 3-4 sentence explanation that: 1. Explains what p=0.03 actually means 2. Explains why waiting might still be the right call 3. Uses no jargon a PM wouldn't understand
✅ Sample Answer "A p-value of 0.03 means there's about a 3% chance we'd see results this extreme if the feature actually had no effect. That sounds convincing, but here's the catch: if we check our experiment results multiple times before it's 'done,' we inflate our chances of seeing a false positive. It's like flipping a coin 10 times—if you stop the moment you get 3 heads in a row and declare the coin biased, you'll often be wrong. The data scientist likely wants to wait until we hit our pre-planned sample size to avoid this 'peeking' problem. Think of it like not opening the oven door while the cake is baking—checking early can ruin the result." **Key Concepts:** Peeking/multiple testing problem, pre-registration of sample size.
Challenge 2: Sample Size Calculation (By Hand) **Problem:** You're planning an A/B test for a checkout flow change. Current conversion rate is 3.2%. You want to detect a 10% relative lift (to 3.52%) with 80% power at α=0.05. **Calculate the required sample size per variant.**
✅ Solution **Formula for proportions (two-tailed):** $$n = \frac{2 \cdot (Z_{\alpha/2} + Z_{\beta})^2 \cdot \bar{p}(1-\bar{p})}{(\Delta)^2}$$ Where: - $Z_{\alpha/2} = 1.96$ (for α=0.05, two-tailed) - $Z_{\beta} = 0.84$ (for 80% power) - $p_1 = 0.032$, $p_2 = 0.0352$ - $\bar{p} = (p_1 + p_2)/2 = 0.0336$ - $\Delta = p_2 - p_1 = 0.0032$ **Calculation:** $$n = \frac{2 \cdot (1.96 + 0.84)^2 \cdot 0.0336 \cdot 0.9664}{(0.0032)^2}$$ $$n = \frac{2 \cdot 7.84 \cdot 0.0325}{0.00001024}$$ $$n = \frac{0.5096}{0.00001024} \approx 49,766$$ **Answer:** ~50,000 users per variant (100,000 total). **Python verification:** ```python from statsmodels.stats.power import NormalIndPower from statsmodels.stats.proportion import proportion_effectsize effect_size = proportion_effectsize(0.032, 0.0352) analysis = NormalIndPower() n = analysis.solve_power(effect_size=effect_size, alpha=0.05, power=0.8, alternative='two-sided') print(f"Sample size per group: {n:.0f}") # ~49,800 ```
Challenge 3: Interpreting Confidence Intervals **Problem:** An experiment shows: - Control conversion: 4.2% (95% CI: 3.9% – 4.5%) - Treatment conversion: 4.6% (95% CI: 4.3% – 4.9%) The PM says "Treatment is clearly better—the CIs don't overlap!" **Questions:** 1. Is the PM's logic correct? 2. What's the actual test for significance here? 3. What additional information would you want?
✅ Solution **1. Is the PM's logic correct?** Partially. Non-overlapping CIs *do* indicate significance, but overlapping CIs don't necessarily mean non-significance! The rule of thumb is unreliable. You need to look at the CI of the *difference*. **2. The actual test:** Calculate the confidence interval for the difference: - Point estimate: 4.6% - 4.2% = 0.4 percentage points - SE of difference: √(SE₁² + SE₂²) - If the CI of the difference excludes 0, it's significant **3. Additional information needed:** - Sample sizes for each group - The CI for the *lift* or *difference*, not individual proportions - Whether this was a one-tailed or two-tailed test - Was the test pre-registered with a specific MDE? **Key Insight:** Always report the CI for the treatment effect, not just individual group CIs.
Challenge 4: Power Analysis Intuition **Problem:** Match each scenario to whether it INCREASES or DECREASES required sample size: | Scenario | Effect on Sample Size | |----------|----------------------| | A. Increasing desired power from 80% to 90% | ? | | B. Decreasing significance level from 0.05 to 0.01 | ? | | C. Baseline conversion rate moves from 5% to 50% | ? | | D. MDE changes from 5% relative to 10% relative | ? | | E. Switching from two-tailed to one-tailed test | ? |
✅ Solutions | Scenario | Effect | Why | |----------|--------|-----| | A. 80% → 90% power | **INCREASES** | Need more data to reduce false negatives | | B. α from 0.05 → 0.01 | **INCREASES** | Stricter threshold requires more evidence | | C. Baseline 5% → 50% | **INCREASES** | Higher variance at p=0.5 (p(1-p) maximized) | | D. MDE 5% → 10% | **DECREASES** | Larger effects are easier to detect | | E. Two-tailed → one-tailed | **DECREASES** | All alpha is on one side; easier to detect | **Memory trick:** Anything that makes detection "harder" (higher power, lower alpha, smaller effect, more variance) increases sample size.

💬 Week 3 Discussion Prompts

  1. “Share your PM-friendly explanation of p-values” — Let others critique for clarity.
  2. “What’s the biggest statistical mistake you’ve seen in A/B testing?” — Learn from real-world errors.
  3. “Post your sample size calculator code” — Compare implementations.

⚠️ Common Week 3 Mistakes

Mistake Why It’s Wrong Fix
“p=0.05 means 5% chance result is due to chance” p-value is P(data|H₀), not P(H₀|data) p-value = probability of seeing this data if null is true
Using 80% power as universal default May need 90%+ for high-stakes decisions Match power to business cost of false negatives
Ignoring multiple testing 20 metrics at α=0.05 → expect 1 false positive Apply Bonferroni or FDR correction
Confusing statistical vs practical significance p=0.001 doesn’t mean the effect matters Always report effect size with confidence interval

✅ Week 3 Self-Assessment

Before moving to Week 4, can you:

  • Explain p-value, power, and CI to a non-technical stakeholder?
  • Calculate sample size by hand for a proportion-based test?
  • Identify when confidence intervals are misleading?
  • Explain why “peeking” at experiment results inflates false positives?

Milestone: Explain CLT, p-value, power in plain English; sample size script.


Week 4: A/B Testing & Experimentation

Goal: Design, size, and analyze an experiment end-to-end.

Day Focus Do This Resource
22 Experiment Design Define hypothesis, unit, success & guardrail metrics for a feature idea Airbnb Blog
23 Randomization & Assignment Explain salt-based hashing; list common pitfalls Netflix Tech Blog
24 Analyze a Mock Result Given sample data, compute lift, CI, p-value; state conclusion Handbook: A/B Test Project
25 Interpreting Edge Cases What if success metric up but guardrail down? Write decision framework Own notes
26 Common Pitfalls (peeking, SRM) List 5 mistakes and how to avoid them Trustworthy Online Experiments (book)
27–28 Catch-up / Mini-project Write up a full experiment doc: hypothesis → design → analysis → decision Your GitHub

🆕 Multi-Armed Bandits (Modern Alternative to A/B Testing)

Traditional A/B tests run to a fixed sample size. Multi-Armed Bandits (MAB) dynamically allocate traffic to better-performing variants, reducing regret (the cost of showing worse variants).

When to Use MAB vs A/B:

Scenario Best Approach Why
High stakes, need statistical rigor A/B Test Fixed sample, clean interpretation
Low stakes, want to minimize regret MAB Adaptive allocation, faster “winner”
Short-lived (< 2 weeks) MAB Less time to waste on losers
Long-term, need precise effect size A/B Test MAB doesn’t give clean CIs
Many variants (5+) MAB Efficiently prunes losers

Simple Thompson Sampling Example:

import numpy as np
from scipy import stats

class ThompsonSampling:
    """Simple Thompson Sampling for Bernoulli bandits (conversions)."""
    
    def __init__(self, n_variants):
        # Beta(1,1) = uniform prior
        self.successes = np.ones(n_variants)  # alpha
        self.failures = np.ones(n_variants)   # beta
    
    def choose_variant(self):
        """Sample from posterior and pick the best."""
        samples = [np.random.beta(s, f) for s, f in zip(self.successes, self.failures)]
        return np.argmax(samples)
    
    def update(self, variant, converted):
        """Update posterior with new observation."""
        if converted:
            self.successes[variant] += 1
        else:
            self.failures[variant] += 1
    
    def get_conversion_rates(self):
        """Get estimated conversion rates (posterior mean)."""
        return self.successes / (self.successes + self.failures)

# Simulate a campaign with 3 variants
bandit = ThompsonSampling(n_variants=3)
true_rates = [0.03, 0.035, 0.04]  # Variant 2 is best

for _ in range(10000):
    chosen = bandit.choose_variant()
    converted = np.random.random() < true_rates[chosen]
    bandit.update(chosen, converted)

print("Estimated rates:", bandit.get_conversion_rates())
print("True rates:     ", true_rates)

Key Insight: Over time, Thompson Sampling naturally allocates more traffic to the best variant while still exploring others.

🤖 AI-Augmented Challenge (Week 4)

Experiment Design Review with AI:

  1. Write a complete experiment design document (hypothesis, metrics, sample size, etc.)
  2. Submit it to ChatGPT with this prompt: “You are a skeptical Staff Data Scientist reviewing this experiment design. Ask me 5 challenging questions about potential flaws, edge cases, or metrics I might have missed.”
  3. Answer each question thoughtfully
  4. Reflection: Did the AI raise concerns you hadn’t considered? Were any of its concerns misguided?

🧠 Week 4 Challenge Problems

Challenge 1: Design a Complete Experiment (Meta-Style Interview Question) **Scenario:** Instagram is considering adding a "Super Like" feature where users can super-like one post per day. This would notify the poster with a special animation. **Your Task:** Write a complete experiment design document covering: 1. **Hypothesis** (one sentence) 2. **Randomization unit** (user, device, session?) and why 3. **Success metrics** (1-2 primary, 2-3 secondary) 4. **Guardrail metrics** (what could go wrong?) 5. **Sample size estimate** (use reasonable assumptions) 6. **Duration** and what could extend it
✅ Sample Design Document **1. Hypothesis:** Adding a Super Like feature will increase meaningful engagement (measured by comments and DMs) by 3% without negatively impacting daily active users or time spent. **2. Randomization Unit:** **User-level** randomization, not session or device: - Super Like is a persistent feature (once you use it, you can't un-use it) - Need consistent experience across devices/sessions - Avoids contamination from shared devices **3. Success Metrics:** | Metric | Type | Rationale | |--------|------|-----------| | Comments per DAU | Primary | Super Likes should spark conversations | | DMs sent per DAU | Primary | Notifications may drive private messages | | Super Likes sent per DAU | Secondary | Feature adoption rate | | Super Likes received per DAU | Secondary | Distribution of engagement | | Posts created per DAU | Secondary | Does receiving Super Likes encourage posting? | **4. Guardrail Metrics:** | Guardrail | Concern | |-----------|---------| | DAU | Feature feels spammy → users leave | | Time spent per session | Users spend time on Super Like UI → less feed time | | Regular likes per DAU | Super Like cannibalizes regular engagement | | Report rate | Super Like used for harassment | | Unfollow rate | Super Like notifications are annoying | **5. Sample Size Estimate:** - Baseline comments/DAU: 0.8 - MDE: 3% relative → 0.024 absolute increase - Variance: Assume Poisson, σ² ≈ μ → σ = 0.89 - With α=0.05, power=80%: ~50,000 users per variant **6. Duration:** - Minimum 2 weeks (full weekly cycle × 2) - Extend if: - Novelty effect suspected (engagement decreasing over time in treatment) - Holiday or major event during test period - Sample ratio mismatch detected
Challenge 2: Analyze This Experiment Result **You ran an experiment with these results:** | Metric | Control (n=50,000) | Treatment (n=50,000) | Lift | 95% CI | p-value | |--------|-------------------|---------------------|------|--------|---------| | Conversion Rate | 3.20% | 3.45% | +7.8% | [+2.1%, +13.5%] | 0.008 | | Revenue per User | $2.10 | $2.08 | -0.9% | [-4.2%, +2.4%] | 0.62 | | Page Load Time | 1.2s | 1.8s | +50% | [+48%, +52%] | <0.001 | **Questions:** 1. Should you ship this feature? Why or why not? 2. What follow-up analysis would you do? 3. How would you present this to leadership?
✅ Analysis Framework **1. Should you ship?** **No, not without further investigation.** Here's the breakdown: | Metric | Verdict | Reasoning | |--------|---------|-----------| | Conversion ✅ | Good | +7.8% is statistically significant and meaningful | | Revenue ⚠️ | Neutral | No significant change; concerning given conversion increase | | Page Load ❌ | Bad | 50% slower is a major UX regression and SEO risk | **The revenue puzzle:** If conversion is up 7.8% but revenue is flat, that means: - Revenue per converter went *down* - Lower-quality conversions? Smaller order values? - Need to segment by customer type **The page load issue:** This is likely the root cause of the revenue problem: - Slower pages = frustrated users = lower conversion quality - Google penalizes slow pages in search rankings - Must fix before shipping **2. Follow-up Analysis:** - Segment conversion by new vs returning users - Check revenue per *converted* user (not all users) - Investigate why treatment is slower (A/B test infrastructure issue? Feature itself?) - Look at bounce rate and time-on-page metrics **3. Presenting to Leadership:** *"The test showed promising conversion gains (+7.8%), but we identified a technical issue: the treatment variant loads 50% slower. This likely explains why revenue didn't follow conversion. I recommend we fix the performance issue and re-run, as the underlying feature hypothesis still looks promising."* **Never present metrics in isolation. Tell a story.**
Challenge 3: Sample Ratio Mismatch (SRM) Detection **Problem:** Your A/B test was configured for 50/50 split. After 1 week, you see: - Control: 102,340 users - Treatment: 97,660 users **Questions:** 1. Is this an SRM? (Use chi-squared test) 2. What are possible causes? 3. What should you do?
✅ Solution **1. Chi-Squared Test for SRM:** ```python from scipy.stats import chisquare observed = [102340, 97660] expected = [100000, 100000] # 50/50 of total 200,000 chi2, p_value = chisquare(observed, expected) print(f"Chi-squared: {chi2:.2f}, p-value: {p_value:.6f}") # Chi-squared: 109.5, p-value: 0.000000 ``` **Result:** p < 0.001 → **This IS a significant SRM.** The split is not random. **2. Possible Causes:** | Cause | How It Happens | |-------|----------------| | Randomization bug | Hash function not evenly distributed | | Bot filtering | Bots disproportionately in one variant | | Browser/device exclusion | Treatment feature doesn't work on Safari → those users excluded | | Redirect-based assignment | Slower treatment page → more users bounce before logging | | Experiment start time | Gradual rollout started at different times | **3. What to Do:** 1. **STOP the experiment** — results are untrustworthy 2. Investigate the assignment mechanism 3. Check if Treatment has any technical issues (crashes, redirects) 4. Look at pre-experiment data to see if groups were already different 5. Fix the issue and restart with fresh users **Rule of Thumb:** If SRM is detected, no amount of "adjusting for it" makes the results valid. The experiment is compromised.
Challenge 4: The Guardrail Paradox **Scenario:** Your experiment shows: - Primary metric (purchases): +5% (p=0.02) ✅ - Guardrail (customer service tickets): +15% (p=0.01) ❌ Leadership wants to ship because revenue is more important than support costs. **What's your recommendation and how do you communicate it?**
✅ Framework **Quantify the trade-off:** 1. Calculate incremental revenue from +5% purchases 2. Calculate incremental cost from +15% support tickets 3. Don't forget: unhappy customers churn, leave bad reviews, reduce LTV **Example Analysis:** ``` Incremental purchases: 5% × 1M purchases × $50 avg = $2.5M Incremental tickets: 15% × 50K tickets × $20 cost each = $150K Net gain: $2.35M ``` BUT also consider: - Why are tickets up? Product confusing? Quality issue? - Will ticket rate stay at +15% or grow as frustrated users accumulate? - What's the customer sentiment in those tickets? **Communication Framework:** *"I recommend we investigate before shipping. The +15% in support tickets isn't just a cost issue—it's a signal that something in the feature is confusing or broken. If we ship without understanding why, we risk:* 1. *Churn from frustrated customers* 2. *Negative reviews affecting acquisition* 3. *Scaling support costs as we grow* *Can we run a qualitative analysis on the ticket text to identify the root cause? If it's a fixable UX issue, we could ship after a quick iteration and re-test."* **Key Insight:** Guardrails exist for a reason. Don't ignore them just because the primary metric looks good.

💬 Week 4 Discussion Prompts

  1. “Share your experiment design doc” — Get peer feedback on metric choices and potential blind spots.
  2. “What’s the hardest experiment decision you’ve had to make?” — Trade-off stories teach the most.
  3. “Post a real or hypothetical guardrail dilemma” — Practice the decision framework.

⚠️ Common Week 4 Mistakes

Mistake Why It’s Wrong Fix
Calling the experiment early when p < 0.05 Peeking inflates false positives Pre-commit to sample size; use sequential testing if needed
Ignoring SRM Results are meaningless with biased assignment Always check SRM before analyzing results
Too many success metrics Increases false positive rate Pick 1-2 primary metrics; adjust for multiple testing
Randomizing at session level for persistent features User sees different variants → confusion and bias Match randomization unit to feature persistence

✅ Week 4 Self-Assessment

Before moving to Week 5, can you:

  • Write a complete experiment design document from scratch?
  • Calculate and interpret lift, confidence interval, and p-value?
  • Detect Sample Ratio Mismatch using chi-squared test?
  • Navigate the trade-off when success metric is up but guardrail is down?

🎯 HALFWAY POINT! You’ve earned your Halfway Hero badge!

Milestone: Complete experiment write-up (design + analysis + recommendation).


Week 5: Python (Pandas & Polars) for Analysis

Goal: Confidently clean, explore, and visualize a messy dataset. New for 2026: Learn Polars as a high-performance alternative.

Open In Colab

Day Focus Do This Resource
29 DataFrame Basics Load CSV, inspect with info(), head(), describe() Handbook: Python Analysis
30 Data Cleaning Handle nulls, duplicates, dtypes; document your decisions Kaggle: Data Cleaning
31 Groupby & Aggregation Compute metrics by segment; compare to SQL approach Pandas Docs
32 Merging & Joining Merge 2 tables; assert key uniqueness before join Handbook: Python Patterns
33 Visualization (Matplotlib/Seaborn) Create bar, line, histogram, heatmap for your dataset Seaborn Gallery
34–35 Catch-up / Mini-project Publish an EDA notebook with insights and 3+ charts Your GitHub

🆕 Polars: The High-Performance Alternative (2026 Must-Know)

Polars is gaining massive traction as a faster, more memory-efficient alternative to Pandas. It uses Rust under the hood and supports lazy evaluation — essential for large datasets.

Why Learn Polars in 2026?

  • 10-100x faster than Pandas for many operations
  • Lazy execution — builds an optimized query plan before running
  • Parallel by default — uses all CPU cores automatically
  • Growing industry adoption — increasingly mentioned in job postings

Pandas vs Polars Comparison:

# ----- PANDAS -----
import pandas as pd

df_pd = pd.read_csv('data/orders.csv')
result_pd = (
    df_pd[df_pd['amount'] > 50]
    .groupby('customer_id')['amount']
    .sum()
    .reset_index()
    .sort_values('amount', ascending=False)
)

# ----- POLARS -----
import polars as pl

df_pl = pl.read_csv('data/orders.csv')
result_pl = (
    df_pl.lazy()  # Enable lazy evaluation
    .filter(pl.col('amount') > 50)
    .group_by('customer_id')
    .agg(pl.col('amount').sum())
    .sort('amount', descending=True)
    .collect()  # Execute the query plan
)

Key Differences: | Feature | Pandas | Polars | |———|——–|——–| | Execution | Eager (immediate) | Lazy (optimized plan) | | Parallelism | Single-threaded | Multi-threaded by default | | Memory | Often copies data | Zero-copy when possible | | Syntax | .loc, .iloc, chaining | Method chaining, expressions | | Null handling | NaN (float) | Native null type |

Week 5 Polars Challenge: Re-implement your Day 30-31 Pandas code in Polars. Compare performance on a larger dataset.

🤖 AI-Augmented Challenge (Week 5)

Refactoring with AI:

  1. Write your EDA code in a “quick and dirty” way (it’s okay to have spaghetti code!)
  2. Once it works, ask an AI: “Refactor this Pandas code into clean, modular functions with proper docstrings. Use best practices for data analysis code.”
  3. Critical step: Review and test the refactored code. Does it produce the same results?
  4. Reflection: What improvements did the AI make? What did you disagree with?

🧠 Week 5 Challenge Problems

Challenge 1: The Messy Dataset Gauntlet **Problem:** You receive this CSV with user transactions. Clean it and answer the business questions. ```python import pandas as pd from io import StringIO data = """user_id,transaction_date,amount,category,country 1,2025-01-15,50.00,Electronics,USA 1,2025-01-15,50.00,Electronics,USA 2,01/20/2025,$75.50,Clothing,usa 3,2025-01-22,,-,Canada 4,2025-01-25,100.00,Electronics,UK 5,Feb 1 2025,25.00,Food,USA 6,2025-02-10,999999.99,Electronics,USA 1,2025-02-15,30.00,Food,United States """ df = pd.read_csv(StringIO(data)) ``` **Issues to fix:** 1. Duplicate row 2. Inconsistent date formats 3. Amount has $ and is string 4. Missing values (amount, category) 5. Country inconsistency (USA vs usa vs United States) 6. Potential outlier ($999,999) **Tasks:** 1. Clean the data 2. What's the total revenue by category (excluding the outlier)? 3. What's the revenue by country (after standardizing)?
✅ Solution ```python import pandas as pd import numpy as np # 1. Remove duplicates df = df.drop_duplicates() # 2. Parse dates with multiple formats def parse_date(date_str): for fmt in ['%Y-%m-%d', '%m/%d/%Y', '%b %d %Y']: try: return pd.to_datetime(date_str, format=fmt) except: continue return pd.NaT df['transaction_date'] = df['transaction_date'].apply(parse_date) # 3. Clean amount: remove $, convert to float df['amount'] = df['amount'].replace('[\$,]', '', regex=True) df['amount'] = pd.to_numeric(df['amount'], errors='coerce') # 4. Handle missing values # - Drop rows with no amount (can't calculate revenue) # - Fill missing category with 'Unknown' df = df.dropna(subset=['amount']) df['category'] = df['category'].replace('-', np.nan).fillna('Unknown') # 5. Standardize country country_map = { 'usa': 'USA', 'united states': 'USA', 'uk': 'UK', 'canada': 'Canada' } df['country'] = df['country'].str.lower().map(country_map).fillna(df['country']) # 6. Handle outlier (IQR method or business rule) # Business rule: transactions over $10,000 are likely errors df_clean = df[df['amount'] <= 10000] # Answer questions print("Revenue by Category:") print(df_clean.groupby('category')['amount'].sum().sort_values(ascending=False)) print("\nRevenue by Country:") print(df_clean.groupby('country')['amount'].sum().sort_values(ascending=False)) ``` **Output:** ``` Revenue by Category: Electronics 150.00 Clothing 75.50 Food 55.00 Unknown 0.00 # Dropped due to NaN amount Revenue by Country: USA 205.50 UK 100.00 ``` **Key Principle:** Document every cleaning decision. Future you (or your reviewer) needs to understand why.
Challenge 2: Groupby Gymnastics (Interview-Level) **Problem:** Given user session data, calculate for each user: 1. Total sessions 2. Average session duration 3. Days since first session 4. Most common device type 5. Conversion rate (sessions with purchase / total sessions) ```python sessions = pd.DataFrame({ 'user_id': [1,1,1,2,2,3], 'session_date': pd.to_datetime(['2025-01-01','2025-01-05','2025-01-10', '2025-01-02','2025-01-08','2025-01-03']), 'duration_seconds': [300, 450, 200, 600, 300, 150], 'device': ['mobile','mobile','desktop','desktop','desktop','mobile'], 'purchased': [False, True, False, True, True, False] }) ```
✅ Solution ```python today = pd.to_datetime('2025-01-15') def most_common(series): """Return most common value, handling ties.""" return series.mode().iloc[0] if len(series.mode()) > 0 else None user_summary = sessions.groupby('user_id').agg( total_sessions=('session_date', 'count'), avg_duration_sec=('duration_seconds', 'mean'), first_session=('session_date', 'min'), most_common_device=('device', most_common), sessions_with_purchase=('purchased', 'sum'), total_sessions_for_cvr=('purchased', 'count') ).reset_index() # Calculate derived fields user_summary['days_since_first'] = (today - user_summary['first_session']).dt.days user_summary['conversion_rate'] = ( user_summary['sessions_with_purchase'] / user_summary['total_sessions_for_cvr'] ) # Clean up user_summary = user_summary.drop(columns=['first_session', 'total_sessions_for_cvr']) print(user_summary.to_string()) ``` **Output:** ``` user_id total_sessions avg_duration_sec most_common_device sessions_with_purchase days_since_first conversion_rate 0 1 3 316.666667 mobile 1 14 0.333333 1 2 2 450.000000 desktop 2 13 1.000000 2 3 1 150.000000 mobile 0 12 0.000000 ``` **Pandas Pro Tips:** - Use `.agg()` with named aggregations for clarity - Create custom aggregation functions for complex logic - Calculate ratios after the groupby, not inside it
Challenge 3: Merge Madness (Fanout Detection) **Problem:** You're merging users and orders. Predict what happens and how to prevent issues. ```python users = pd.DataFrame({ 'user_id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Carol'] }) orders = pd.DataFrame({ 'order_id': [101, 102, 103, 104, 105], 'user_id': [1, 1, 2, 2, 4], # Note: user 4 doesn't exist in users 'amount': [50, 30, 75, 25, 100] }) ``` **Questions:** 1. What's the row count after `users.merge(orders)`? (default inner join) 2. What's the row count after `users.merge(orders, how='left')`? 3. How do you calculate revenue per user without row fanout affecting your calculation?
✅ Solution ```python # Question 1: Inner join inner = users.merge(orders) print(f"Inner join rows: {len(inner)}") # 4 rows (user 1's 2 orders + user 2's 2 orders) # Question 2: Left join left = users.merge(orders, how='left') print(f"Left join rows: {len(left)}") # 5 rows (includes Carol with NaN order) # Question 3: Calculate revenue per user WITHOUT fanout # WRONG: merge then groupby (works but wastes memory on large data) # RIGHT: aggregate first, then merge # Method 1: Aggregate orders first user_revenue = orders.groupby('user_id')['amount'].sum().reset_index() user_revenue.columns = ['user_id', 'total_revenue'] result = users.merge(user_revenue, how='left') result['total_revenue'] = result['total_revenue'].fillna(0) print(result) # Method 2: Use .map() for simple lookups revenue_lookup = orders.groupby('user_id')['amount'].sum() users['total_revenue'] = users['user_id'].map(revenue_lookup).fillna(0) ``` **Fanout Prevention Rule:** > Always aggregate the "many" side before merging to the "one" side. **Verification Pattern:** ```python # Before merge print(f"Users: {len(users)}, Orders: {len(orders)}") # After merge merged = users.merge(orders) print(f"Merged: {len(merged)}") # Should equal len(orders) for inner, ≥ len(users) for left # Assert no unexpected fanout expected_rows = len(orders) # for inner join on user_id assert len(merged) == expected_rows, f"Fanout detected: {len(merged)} vs expected {expected_rows}" ```
Challenge 4: Visualization Best Practices **Problem:** Critique these common visualization mistakes and fix them. **Mistake 1: Pie chart for 10 categories** ```python # Bad df.groupby('category')['sales'].sum().plot.pie() ``` **Mistake 2: Dual Y-axes with different scales** ```python # Misleading fig, ax1 = plt.subplots() ax1.plot(df['date'], df['revenue']) # Scale: 0-10M ax2 = ax1.twinx() ax2.plot(df['date'], df['users']) # Scale: 0-1000 ``` **Mistake 3: No context for bar chart** ```python # What's good? What's bad? df.plot.bar(x='region', y='conversion_rate') ```
✅ Better Approaches **Fix 1: Use horizontal bar chart for many categories** ```python import matplotlib.pyplot as plt import seaborn as sns # Sort and limit to top categories top_cats = df.groupby('category')['sales'].sum().nlargest(8) top_cats.sort_values().plot.barh(color='steelblue') plt.xlabel('Sales ($)') plt.title('Top 8 Categories by Sales') plt.tight_layout() ``` **Fix 2: Index both series or use separate charts** ```python # Option A: Index to same baseline fig, ax = plt.subplots() df['revenue_indexed'] = df['revenue'] / df['revenue'].iloc[0] * 100 df['users_indexed'] = df['users'] / df['users'].iloc[0] * 100 ax.plot(df['date'], df['revenue_indexed'], label='Revenue (indexed)') ax.plot(df['date'], df['users_indexed'], label='Users (indexed)') ax.set_ylabel('Index (Day 1 = 100)') ax.legend() # Option B: Small multiples (two separate charts) fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8), sharex=True) ax1.plot(df['date'], df['revenue']) ax1.set_ylabel('Revenue ($)') ax2.plot(df['date'], df['users']) ax2.set_ylabel('Users') ``` **Fix 3: Add benchmark/context** ```python fig, ax = plt.subplots(figsize=(10, 6)) # Add benchmark line benchmark = 0.05 # 5% company average bars = ax.bar(df['region'], df['conversion_rate'], color='steelblue') ax.axhline(y=benchmark, color='red', linestyle='--', label=f'Company Avg ({benchmark:.0%})') # Color bars by performance for bar, val in zip(bars, df['conversion_rate']): if val < benchmark: bar.set_color('coral') ax.set_ylabel('Conversion Rate') ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:.0%}')) ax.legend() plt.title('Conversion Rate by Region vs. Company Average') ```

💬 Week 5 Discussion Prompts

  1. “Share your messiest data cleaning challenge” — What issues did you find and how did you solve them?
  2. “Post your EDA notebook” — Get feedback on structure, visualization choices, and insights.
  3. “Pandas vs SQL: which do you prefer for [task]?” — Compare approaches for groupby, joins, window functions.

⚠️ Common Week 5 Mistakes

Mistake Why It’s Wrong Fix
df['col'] = df['col'].fillna(0) on original SettingWithCopyWarning; may not persist Use df.loc[:, 'col'] = ... or .copy() first
Merging without checking key uniqueness Silent fanout multiplies your rows assert df['key'].is_unique before merge
Using apply() for vectorizable operations 10-100x slower than vectorized Use built-in pandas/numpy operations
Not resetting index after groupby Index becomes the grouped column Chain .reset_index() for clean DataFrame

✅ Week 5 Self-Assessment

Before moving to Week 6, can you:

  • Load a CSV and immediately identify its data quality issues?
  • Write a groupby with multiple aggregations including custom functions?
  • Merge two tables and verify no unexpected row fanout occurred?
  • Create a publication-quality chart with title, labels, and context?

Milestone: EDA notebook with documented cleaning + visualizations.


Week 6: Product Metrics & Case Studies

Goal: Think like a product analyst; answer “why” and “what next.”

Day Focus Do This Resource
36 Metrics Frameworks (HEART, AARRR) Apply one framework to a product you use daily HEART Framework
37 North Star & Input Metrics Define for a hypothetical feature launch Own notes
38 Case Study Walk-Through Crack one open-ended case: “DAU dropped 5%—investigate” Handbook: Analytical Execution
39 Case Study Walk-Through Crack another: “Should we launch X feature internationally?” Exponent Course
40 Case Study Walk-Through Crack a metrics definition case StrataScratch
41–42 Catch-up / Mini-project Write a 1-page case study answer doc with structure Your GitHub

🆕 Metric Trees (Hierarchical Metric Visualization)

Metric Trees help you visualize how metrics relate to each other hierarchically. This is essential for:

  • Understanding which input metrics drive your North Star
  • Diagnosing drops (which branch of the tree is affected?)
  • Communicating metric strategy to stakeholders

Example: E-Commerce Revenue Metric Tree

                    ┌─────────────────────────────────────┐
                    │        NORTH STAR: Revenue          │
                    │         = Orders × AOV              │
                    └─────────────────────────────────────┘
                                      │
            ┌─────────────────────────┴─────────────────────────┐
            ▼                                                   ▼
    ┌───────────────────┐                           ┌───────────────────┐
    │      Orders       │                           │ Average Order Val │
    │ = Visitors × CVR  │                           │   = Revenue/Orders│
    └───────────────────┘                           └───────────────────┘
            │                                               │
    ┌───────┴───────┐                               ┌───────┴───────┐
    ▼               ▼                               ▼               ▼
┌─────────┐   ┌─────────┐                     ┌─────────┐   ┌─────────┐
│Visitors │   │   CVR   │                     │Units per│   │Unit     │
│         │   │         │                     │  Order  │   │ Price   │
└─────────┘   └─────────┘                     └─────────┘   └─────────┘
    │               │
    ▼               ▼
┌─────────────┐ ┌─────────────┐
│Organic      │ │Add-to-Cart  │
│Paid         │ │Cart→Checkout│
│Direct       │ │Checkout→Pay │
└─────────────┘ └─────────────┘

How to Use Metric Trees:

  1. Start at the top with your North Star (Revenue, DAU, etc.)
  2. Decompose mathematically — use multiplications and additions
  3. Go 3-4 levels deep until you reach actionable levers
  4. Color-code for diagnosis — red for declining, green for improving

Tools for Creating Metric Trees:

  • Miro — drag-and-drop diagrams
  • Excalidraw — hand-drawn style, great for presentations
  • Whimsical — clean flowcharts

🤖 AI-Augmented Challenge (Week 6): Simulated Stakeholder Bot

Practice Case Study Defense:

  1. Write your case study answer for “DAU dropped 5%”
  2. Submit it to ChatGPT with this prompt:

    “You are a skeptical VP of Product. I’m presenting my analysis of why DAU dropped 5%. Challenge my findings with tough questions like: ‘But how do you know it’s not just seasonality?’ or ‘What’s the confidence level on that conclusion?’ Be specific and push back on weak logic.”

  3. Defend your analysis in a back-and-forth conversation
  4. Reflection: What gaps in your analysis did the “VP” expose?

🧠 Week 6 Challenge Problems

Challenge 1: DAU Dropped 5% — Full Investigation Framework **Scenario:** You're the data scientist at a social media app. The CEO pings you at 9 AM: *"DAU is down 5% compared to last week. What's happening?"* **Your Task:** Walk through a complete investigation framework.
✅ Complete Framework ### Step 1: Clarify & Don't Panic (2 minutes) **Questions to ask:** - Is this 5% week-over-week or compared to the same day last week? - Is this statistically significant given normal variance? - When did it start? (Today? Gradual over the week?) **Quick sanity checks:** - Is it a holiday or unusual day? - Any major external events? (competitor launch, news) - Were there any product deployments? ### Step 2: Decomposition Framework (MECE Breakdown) **Split DAU into components:** ``` DAU = New Users + Returning Users Or: DAU = Σ (Users by Platform) = iOS + Android + Web Or: DAU = Σ (Users by Country) = US + UK + India + ... Or: DAU = Σ (Users by Acquisition Source) = Organic + Paid + Referral ``` **Check each slice:** ```sql SELECT activity_date, user_type, -- 'new' vs 'returning' COUNT(DISTINCT user_id) AS dau FROM daily_activity WHERE activity_date BETWEEN DATE_SUB(CURRENT_DATE, INTERVAL 14 DAY) AND CURRENT_DATE GROUP BY 1, 2 ORDER BY 1, 2; ``` ### Step 3: Hypothesis Tree | Hypothesis | How to Check | What Would Confirm It | |------------|--------------|----------------------| | **Supply-side: Less content** | Posts created per day | Down more than DAU | | **Demand-side: Less engagement** | Sessions per user | Flat or up (not the issue) | | **Technical: App issues** | Crash rate, error logs | Spike correlates with DAU drop | | **External: Competition** | App store rankings, competitor news | Competitor launched/promoted | | **Seasonal: Natural variance** | Same period last year | Similar pattern historically | | **Acquisition: Fewer new users** | New signups, ad spend | Marketing budget cut or CAC spike | | **Retention: Users leaving** | D1, D7 retention rates | Drop in recent cohorts | ### Step 4: Drill Down (Example Findings) **Finding:** The drop is entirely in Android, specifically in India. **Next questions:** - Was there an Android-only deployment? - Did Google Play feature placement change? - Any carrier/ISP issues in India? - Was there an Android update that broke something? ### Step 5: Structure Your Answer **For the CEO (30-second version):** *"DAU is down 5% driven entirely by Android users in India. We deployed an update Tuesday that increased app size by 40MB, which is causing download/update failures on lower-end devices. Engineering is rolling back now; we expect recovery in 24-48 hours."* **For a detailed writeup:** 1. Summary of impact (quantify) 2. Root cause with supporting data 3. Immediate actions taken 4. Longer-term prevention steps 5. How we'll monitor recovery
Challenge 2: Define Metrics for a New Feature (Product Sense) **Scenario:** Instagram is launching "Close Friends Story Reactions" — when you react to a close friend's story, they see a special animation. You're asked to define success metrics. **Your Task:** Define a complete metrics framework.
✅ Complete Metrics Framework ### North Star Metric **Close Friends Interactions per User per Week** Why: This measures the core behavior we're trying to drive — deeper connections with close friends. ### Primary Metrics (Success Criteria) | Metric | Definition | Target | Rationale | |--------|------------|--------|-----------| | Close Friends Reaction Rate | (Reactions to CF stories) / (Views of CF stories) | +10% vs baseline | Direct feature usage | | CF Story Replies | DMs sent in response to CF stories | +5% | Reactions should spark conversation | ### Secondary Metrics (Mechanism Validation) | Metric | What It Tells Us | |--------|------------------| | CF Story Views per User | Are people watching more CF content? | | Time to First Reaction | Is the new animation discoverable? | | Reaction Type Distribution | Which animations are most popular? | | CF List Size | Are users adding more close friends? | ### Guardrail Metrics (Watch for Harm) | Guardrail | Concern | Threshold | |-----------|---------|-----------| | Total Story Views | CF reactions cannibalize regular stories | No decrease | | DAU | Feature is annoying/spammy | No decrease | | CF Removal Rate | Users remove friends who over-react | No increase | | Story Creation Rate | Creators feel pressured by reactions | No decrease | ### Segmentation Plan Analyze all metrics by: - New vs returning users - Heavy vs light CF users - iOS vs Android - Market (US, India, Brazil, etc.) ### Counter-metrics to Watch **If Close Friends reactions go UP but these go DOWN, we may have a problem:** - Regular story reactions (cannibalization) - DM conversations (reactions replace actual conversation) - Close Friends story creation (creators feel pressure)
Challenge 3: Launch Decision Framework **Scenario:** You ran a test for a new checkout flow. Results: | Metric | Control | Treatment | Lift | p-value | |--------|---------|-----------|------|---------| | Conversion Rate | 3.2% | 3.4% | +6.2% | 0.03 | | Revenue per Visitor | $4.50 | $4.45 | -1.1% | 0.42 | | Customer Service Tickets | 0.8% | 1.2% | +50% | 0.001 | | Page Load Time | 2.1s | 2.3s | +10% | 0.02 | **Question:** Should we launch? Present your recommendation.
✅ Structured Recommendation ### Executive Summary: **Do Not Launch (Yet)** ### Analysis **The Good:** - Conversion rate increased 6.2% (statistically significant) - This would translate to ~$X additional revenue per month **The Concerning:** - Revenue per visitor is flat/slightly down (not significant, but directionally negative) - **Customer service tickets up 50%** — this is a major red flag - Page load time increased 10% — impacts SEO and user experience ### Diagnosis The conversion increase paired with flat revenue and higher support tickets suggests: - More users are completing checkout, but... - They may be confused about what they purchased, OR - They're encountering issues post-purchase, OR - The new flow attracts lower-value conversions ### Recommendation 1. **Investigate support ticket content** — What are users complaining about? 2. **Analyze conversion quality** — Are these incremental converters or just shifted from returning later? 3. **Fix page load regression** — This is likely addressable 4. **Consider partial rollout** — Launch to power users first, gather qualitative feedback ### Decision Framework | Scenario | Action | |----------|--------| | Tickets are about one fixable issue | Fix it, re-run for 1 week | | Tickets show fundamental confusion | Redesign UX, full re-test | | Tickets are about external factors | Address, then proceed with launch | | Revenue per visitor drops significantly | Kill the test | ### Communication *"The test showed promising conversion gains, but support ticket volume increased 50%. Before launching, I recommend we analyze ticket content to understand the root cause. If it's a fixable UX issue, we can iterate quickly and capture the conversion gains without degrading support costs."*
Challenge 4: "Metric Went Up — Is It Good?" (Nuance Exercise) **For each scenario, is the metric increase good, bad, or "it depends"?** | Metric | Change | Good/Bad/Depends? | |--------|--------|-------------------| | Time spent in app | +15% | ? | | Customer service ticket resolution time | +20% | ? | | Number of A/B tests run | +50% | ? | | DAU / MAU ratio | +10% | ? | | Signup conversion rate | +25% | ? |
✅ Answers with Reasoning | Metric | Change | Verdict | Reasoning | |--------|--------|---------|-----------| | Time spent in app | +15% | **Depends** | Good if users are engaged. Bad if they're confused/stuck. Check if it correlates with satisfaction/NPS. | | CS ticket resolution time | +20% | **Bad** | Longer resolution = worse customer experience. Unless tickets are more complex (need more data). | | Number of A/B tests run | +50% | **Depends** | Good if test quality is maintained. Bad if it's "test more" without learning more. | | DAU/MAU ratio | +10% | **Good** | Higher stickiness — users engage more frequently. Classic engagement metric. | | Signup conversion | +25% | **Depends** | Good if quality maintained. Check D1, D7 retention of new users. May have lowered friction too much (low-quality signups). | **Meta-lesson:** Almost every metric is "it depends." Good analysts ask: "What behavior does this metric incentivize? What would game it?"

💬 Week 6 Discussion Prompts

  1. “Share your ‘DAU dropped’ investigation” — Post your decomposition approach.
  2. “What’s the most misleading metric you’ve encountered?” — Discuss Goodhart’s Law in practice.
  3. “Present your case study write-up” — Get feedback on structure and communication.

⚠️ Common Week 6 Mistakes

Mistake Why It’s Wrong Fix
Jumping to solutions before diagnosing You might solve the wrong problem Always decompose first (MECE)
Only looking at one metric Miss cannibalization, side effects Always check guardrails and counter-metrics
Assuming more = better Time spent ≠ value; clicks ≠ satisfaction Understand the behavior behind the metric
Not quantifying impact “It’s down” vs “It’s down $50K/day” Always translate to business impact

✅ Week 6 Self-Assessment

Before moving to Week 7, can you:

  • Decompose a metric drop into MECE components?
  • Define a North Star metric with supporting primary, secondary, and guardrail metrics?
  • Structure a go/no-go launch recommendation with data?
  • Identify when a metric increase might actually be bad?

Milestone: 3 case studies solved; 1 written doc.


Week 7: Analytical Engineering Basics

Goal: Understand how analysts and AEs model data for reliability. New for 2026: Get hands-on with dbt Core.

Day Focus Do This Resource
43 Raw → Staging → Mart Layers Read the AE index page; sketch layers for an orders pipeline Handbook: Analytical Engineering
44 De-duplication Patterns Write a ROW_NUMBER dedupe query; explain tie-breakers Handbook: Advanced SQL
45 Idempotent Loads (ON CONFLICT) Write an upsert in Postgres Handbook: Advanced SQL
46 Data Quality Checks Write 3 test queries (uniqueness, not-null, referential) Handbook: AE Checks
47 Date Spines & Completeness Generate a date spine; LEFT JOIN to fill gaps Handbook: Date Spine
48–49 Catch-up / Mini-project Commit a small pipeline (staging view + mart table + test queries) Your GitHub

🆕 dbt Core Tutorial (Industry-Standard Transformation Tool)

dbt (data build tool) is the industry standard for transforming data in the warehouse. Even if you don’t become an Analytics Engineer, understanding dbt is valuable for:

  • Reading and contributing to data models at your company
  • Understanding how modern data teams operate
  • Speaking the same language as your AE counterparts

Quick Start with dbt + DuckDB (No Cloud Required):

# Install dbt with DuckDB adapter
pip install dbt-core dbt-duckdb

# Initialize a new dbt project
dbt init my_analytics_project
cd my_analytics_project

Project Structure:

my_analytics_project/
├── models/
│   ├── staging/           # Clean, rename, dedupe raw data
│   │   ├── stg_orders.sql
│   │   ├── stg_products.sql
│   │   └── _staging.yml   # Schema tests
│   ├── intermediate/      # Business logic, joins
│   │   └── int_order_items_enriched.sql
│   └── marts/             # Final aggregations
│       └── mart_daily_revenue.sql
├── tests/                 # Custom data tests
├── macros/                # Reusable SQL snippets
├── dbt_project.yml        # Project configuration
└── profiles.yml           # Database connections

Example Staging Model (models/staging/stg_orders.sql):

-- stg_orders.sql
-- This model deduplicates raw orders and standardizes column types

with source as (
    select * from 
),

deduplicated as (
    select
        *,
        row_number() over (
            partition by order_id 
            order by _loaded_at desc
        ) as row_num
    from source
)

select
    order_id,
    customer_id,
    order_date::date as order_date,
    amount::numeric(12,2) as amount,
    status
from deduplicated
where row_num = 1
  and status != 'cancelled'

Add Tests (models/staging/_staging.yml):

version: 2

models:
  - name: stg_orders
    description: "Cleaned and deduplicated orders"
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
      - name: amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 100000

Run dbt:

# Build all models
dbt run

# Run tests
dbt test

# Generate documentation
dbt docs generate
dbt docs serve  # Opens in browser at localhost:8080

🤖 AI-Augmented Challenge (Week 7)

dbt Model Generation with AI:

  1. Describe your data source to ChatGPT: “I have a raw_orders table with columns (order_id, customer_id, order_date, amount, status, _loaded_at). Generate a dbt staging model that deduplicates by order_id, filters cancelled orders, and casts types appropriately.”
  2. Review the generated SQL — does it follow dbt best practices?
  3. Add tests and run dbt test to validate
  4. Reflection: What did the AI get right? What did you need to customize?

🧠 Week 7 Challenge Problems

Challenge 1: Design a Data Pipeline (Whiteboard Exercise) **Scenario:** You have three source tables from your e-commerce platform: - `raw_orders` — every order placed (may have duplicates from retry logic) - `raw_order_items` — line items within orders - `raw_products` — product catalog **Design a pipeline to create a `mart_daily_revenue` table showing daily revenue by product category.**
✅ Complete Pipeline Design ### Layer Architecture ``` RAW (source of truth, append-only) ├── raw_orders ├── raw_order_items └── raw_products ↓ STAGING (cleaned, deduplicated, typed) ├── stg_orders ├── stg_order_items └── stg_products ↓ INTERMEDIATE (business logic applied) ├── int_order_items_enriched (joined with product info) ↓ MARTS (aggregated for consumption) └── mart_daily_revenue ``` ### SQL Implementation **Step 1: Staging — Deduplicate Orders** ```sql CREATE OR REPLACE VIEW stg_orders AS WITH deduplicated AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY order_id ORDER BY _loaded_at DESC -- Keep most recent if duplicated ) AS rn FROM raw_orders WHERE order_status != 'cancelled' -- Exclude cancelled ) SELECT order_id, customer_id, order_date::date AS order_date, order_total::numeric(12,2) AS order_total, _loaded_at FROM deduplicated WHERE rn = 1; ``` **Step 2: Staging — Clean Order Items** ```sql CREATE OR REPLACE VIEW stg_order_items AS SELECT order_item_id, order_id, product_id, quantity::int AS quantity, unit_price::numeric(10,2) AS unit_price, (quantity * unit_price)::numeric(12,2) AS line_total FROM raw_order_items WHERE quantity > 0 -- Exclude invalid records AND unit_price > 0; ``` **Step 3: Staging — Products** ```sql CREATE OR REPLACE VIEW stg_products AS SELECT product_id, product_name, COALESCE(category, 'Uncategorized') AS category, COALESCE(subcategory, 'Other') AS subcategory FROM raw_products; ``` **Step 4: Intermediate — Enrich Order Items** ```sql CREATE OR REPLACE VIEW int_order_items_enriched AS SELECT oi.order_item_id, oi.order_id, o.order_date, o.customer_id, oi.product_id, p.product_name, p.category, p.subcategory, oi.quantity, oi.unit_price, oi.line_total FROM stg_order_items oi JOIN stg_orders o ON oi.order_id = o.order_id JOIN stg_products p ON oi.product_id = p.product_id; ``` **Step 5: Mart — Daily Revenue** ```sql CREATE TABLE mart_daily_revenue AS WITH date_spine AS ( SELECT generate_series( (SELECT MIN(order_date) FROM stg_orders), CURRENT_DATE, '1 day'::interval )::date AS dt ), revenue_by_day_category AS ( SELECT order_date, category, SUM(line_total) AS revenue, COUNT(DISTINCT order_id) AS orders, SUM(quantity) AS units_sold FROM int_order_items_enriched GROUP BY order_date, category ) SELECT ds.dt AS date, COALESCE(r.category, 'No Sales') AS category, COALESCE(r.revenue, 0) AS revenue, COALESCE(r.orders, 0) AS orders, COALESCE(r.units_sold, 0) AS units_sold FROM date_spine ds LEFT JOIN revenue_by_day_category r ON ds.dt = r.order_date ORDER BY ds.dt, r.category; ``` ### Why This Design? | Decision | Reason | |----------|--------| | Views for staging | Easy to update; no data duplication | | Table for mart | Performance for dashboard queries | | Date spine | Ensures no gaps for time series charts | | Explicit COALESCE | Documents handling of missing data | | ROW_NUMBER for dedupe | Deterministic (vs DISTINCT which is arbitrary) |
Challenge 2: Write Data Quality Tests **Problem:** Write SQL tests that would catch these data issues: 1. Duplicate order IDs 2. NULL customer IDs on orders 3. Orders referencing non-existent products 4. Negative quantities 5. Orders dated in the future
✅ Test Queries ```sql -- Test 1: No duplicate order IDs in staging -- Expected: 0 rows SELECT order_id, COUNT(*) AS duplicates FROM stg_orders GROUP BY order_id HAVING COUNT(*) > 1; -- Test 2: No NULL customer IDs -- Expected: 0 rows SELECT COUNT(*) AS null_customers FROM stg_orders WHERE customer_id IS NULL; -- Test 3: Referential integrity — all products exist -- Expected: 0 rows SELECT oi.product_id, COUNT(*) AS orphan_count FROM stg_order_items oi LEFT JOIN stg_products p ON oi.product_id = p.product_id WHERE p.product_id IS NULL GROUP BY oi.product_id; -- Test 4: No negative quantities -- Expected: 0 rows SELECT COUNT(*) AS negative_qty FROM stg_order_items WHERE quantity < 0; -- Test 5: No future orders -- Expected: 0 rows SELECT COUNT(*) AS future_orders FROM stg_orders WHERE order_date > CURRENT_DATE; -- Bonus: Aggregate test — revenue should match -- Check that mart totals equal source totals WITH mart_total AS ( SELECT SUM(revenue) AS mart_revenue FROM mart_daily_revenue ), source_total AS ( SELECT SUM(line_total) AS source_revenue FROM int_order_items_enriched ) SELECT m.mart_revenue, s.source_revenue, ABS(m.mart_revenue - s.source_revenue) AS difference FROM mart_total m, source_total s WHERE ABS(m.mart_revenue - s.source_revenue) > 0.01; -- Tolerance for rounding ``` ### dbt-Style Test YAML If using dbt, these translate to: ```yaml models: - name: stg_orders columns: - name: order_id tests: - unique - not_null - name: customer_id tests: - not_null - name: order_date tests: - not_null - dbt_utils.accepted_range: max_value: "current_date" - name: stg_order_items columns: - name: product_id tests: - relationships: to: ref('stg_products') field: product_id - name: quantity tests: - dbt_utils.accepted_range: min_value: 0 ```
Challenge 3: Handle Late-Arriving Data **Problem:** Your pipeline runs daily at 6 AM. But some orders from yesterday arrive at 7 AM (late-arriving data). How do you handle this? **Approaches to consider:** 1. Re-run the entire pipeline 2. Use incremental processing 3. Use event time vs processing time
✅ Solution Patterns ### Pattern 1: Full Refresh (Simple but Expensive) ```sql -- Just rebuild the whole mart TRUNCATE mart_daily_revenue; INSERT INTO mart_daily_revenue SELECT ... FROM int_order_items_enriched ...; ``` **Pros:** Simple, always correct **Cons:** Slow for large tables; wasteful ### Pattern 2: Incremental with Lookback Window ```sql -- Delete and replace recent data (e.g., last 3 days) DELETE FROM mart_daily_revenue WHERE date >= CURRENT_DATE - INTERVAL '3 days'; INSERT INTO mart_daily_revenue SELECT ... FROM int_order_items_enriched WHERE order_date >= CURRENT_DATE - INTERVAL '3 days'; ``` **Pros:** Handles late-arriving data within window **Cons:** Need to choose appropriate window size ### Pattern 3: Upsert (Postgres ON CONFLICT) ```sql -- Add unique constraint ALTER TABLE mart_daily_revenue ADD CONSTRAINT unique_date_category UNIQUE (date, category); -- Upsert INSERT INTO mart_daily_revenue (date, category, revenue, orders, units_sold) SELECT order_date, category, SUM(line_total), COUNT(DISTINCT order_id), SUM(quantity) FROM int_order_items_enriched WHERE order_date >= CURRENT_DATE - INTERVAL '3 days' GROUP BY order_date, category ON CONFLICT (date, category) DO UPDATE SET revenue = EXCLUDED.revenue, orders = EXCLUDED.orders, units_sold = EXCLUDED.units_sold, updated_at = NOW(); ``` **Pros:** Only touches changed rows; idempotent **Cons:** Requires unique constraint; more complex ### Pattern 4: Event Time Partitioning ```sql -- Partition by order_date (event time), not load_date (processing time) CREATE TABLE mart_daily_revenue ( date DATE, category TEXT, revenue NUMERIC, ... ) PARTITION BY RANGE (date); -- Replace only affected partitions ALTER TABLE mart_daily_revenue DETACH PARTITION mart_daily_revenue_2025_12; -- Rebuild that month CREATE TABLE mart_daily_revenue_2025_12 AS ...; ALTER TABLE mart_daily_revenue ATTACH PARTITION mart_daily_revenue_2025_12 FOR VALUES FROM ('2025-12-01') TO ('2026-01-01'); ``` **Pros:** Efficient for large tables; atomic partition swaps **Cons:** Operational complexity ### Decision Matrix | Data Volume | Lateness Tolerance | Recommended Pattern | |-------------|-------------------|---------------------| | Small (<1M rows) | Hours | Full refresh | | Medium | Days | Incremental + lookback | | Large | Days | Upsert or partitioning | | Very large | Weeks | Partitioning + selective rebuild |
Challenge 4: Debug a Broken Pipeline **Scenario:** Your daily revenue mart is showing $0 for yesterday. The source tables have data. What do you check?
✅ Debugging Checklist ### Immediate Checks (< 5 minutes) ```sql -- 1. Is there data in source for yesterday? SELECT COUNT(*) FROM raw_orders WHERE order_date::date = CURRENT_DATE - 1; -- 2. Did staging dedupe remove everything? SELECT COUNT(*) FROM stg_orders WHERE order_date = CURRENT_DATE - 1; -- 3. Is there a JOIN failure (orphan records)? SELECT COUNT(*) FROM stg_order_items oi LEFT JOIN stg_orders o ON oi.order_id = o.order_id WHERE o.order_id IS NULL; -- 4. Date/timezone issue? SELECT DISTINCT order_date, order_date AT TIME ZONE 'UTC' AT TIME ZONE 'America/Los_Angeles' AS pst_date FROM stg_orders ORDER BY order_date DESC LIMIT 10; -- 5. Was the mart actually updated? SELECT MAX(updated_at) FROM mart_daily_revenue; ``` ### Common Root Causes | Symptom | Likely Cause | Fix | |---------|--------------|-----| | Raw has data, staging is empty | WHERE clause too restrictive | Check staging view filters | | Staging has data, mart is 0 | JOIN condition mismatch | Check key column types/values | | Only yesterday is 0 | Pipeline didn't run | Check scheduler logs | | All dates are wrong | Timezone conversion issue | Ensure consistent timezone handling | | Partial data missing | Late-arriving data | Extend lookback window | ### Prevention: Add Pipeline Monitoring ```sql -- Freshness check SELECT MAX(order_date) AS latest_order, CURRENT_DATE - MAX(order_date) AS days_behind FROM mart_daily_revenue; -- Expect: days_behind should be 0 or 1 -- Volume anomaly check SELECT date, revenue, AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING) AS trailing_avg, revenue / NULLIF(AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING), 0) AS ratio FROM mart_daily_revenue WHERE date >= CURRENT_DATE - 14 ORDER BY date DESC; -- Expect: ratio should be between 0.5 and 2.0 (alert if outside) ```

💬 Week 7 Discussion Prompts

  1. “Share your staging → mart pipeline” — Get feedback on design decisions.
  2. “What’s the worst data quality bug you’ve encountered?” — Learn from real-world failures.
  3. “dbt vs raw SQL: what’s your preference?” — Discuss tooling trade-offs.

⚠️ Common Week 7 Mistakes

Mistake Why It’s Wrong Fix
Not testing for referential integrity Silent data loss on JOINs Add LEFT JOIN checks; count orphan records
Using SELECT * in views Schema changes break downstream Explicitly list columns
No date spine Gaps in time series break charts Always LEFT JOIN from a complete date range
Hardcoded dates Pipeline breaks tomorrow Use CURRENT_DATE and relative intervals

✅ Week 7 Self-Assessment

Before moving to Week 8, can you:

  • Design a raw → staging → mart pipeline on a whiteboard?
  • Write a ROW_NUMBER dedupe query with explicit tie-breakers?
  • Create data quality tests for uniqueness, nulls, and referential integrity?
  • Handle late-arriving data with an incremental load pattern?

Milestone: Working staging → mart pipeline with tests.


Week 8: Capstone & Storytelling

Goal: Synthesize everything into a portfolio-quality deliverable. New for 2026: Required 3-minute video walkthrough.

Day Focus Do This Resource
50 Choose Capstone Topic Pick a question you can answer with a public dataset Kaggle, Census, Google Trends
51 Data Prep & EDA Clean and explore; note 3 hypotheses Your notebook
52 Analysis & Insight Run stats/SQL/Python; validate or reject hypotheses Your notebook
53 Build Presentation Create 5 slides: Problem → Approach → Key Insight → Recommendation → Next Steps Google Slides / Canva
54 Record Video Walkthrough Record a 3-minute Loom explaining your project (required!) Loom
55–56 Polish & Publish Finalize notebook + slides + video; post on LinkedIn/GitHub Your portfolio

🆕 Video Walkthrough Requirement (Communication is 50% of the Job)

Why This Matters:

  • In real jobs, you’ll present findings to stakeholders who won’t read your notebook
  • Video communication is increasingly expected (remote work, async updates)
  • This is your chance to practice explaining technical work to non-technical audiences

3-Minute Video Structure:

  1. 0:00-0:30 — The Hook: What question did you answer? Why does it matter?
  2. 0:30-1:30 — The Method: How did you approach it? (Keep it high-level, avoid code details)
  3. 1:30-2:30 — The Insight: What did you find? Show your key visualization.
  4. 2:30-3:00 — The Recommendation: What should the business do? What’s the next step?

Tools for Recording:

  • Loom — Free, easy screen + camera recording
  • OBS Studio — Free, more control (local recording)
  • Zoom — Record yourself presenting

Pro Tips:

  • Write a script first (or at least bullet points)
  • Practice once before recording
  • It’s okay to do multiple takes!
  • Include your face in the corner — builds trust

🤖 AI-Augmented Challenge (Week 8)

Presentation Critique with AI:

  1. Write your 5-slide presentation outline
  2. Submit it to ChatGPT: “You are a presentation coach. Critique this data presentation structure. What’s confusing? What’s missing? How can I make the story more compelling?”
  3. Revise based on feedback
  4. Reflection: What storytelling improvements did the AI suggest?

🎯 Week 8 Capstone Framework

Your capstone should demonstrate every skill you’ve built:

Week Skill to Showcase How to Demonstrate in Capstone
1-2 SQL mastery Include a complex query (CTEs, window functions)
3 Statistics Include a statistical test with CI
4 A/B testing mindset Frame a “next steps” experiment
5 Python/Pandas (or Polars!) Show clean data pipeline code
6 Product metrics Track a North Star metric with metric tree
7 AE fundamentals Document your data transformation logic
8 Communication 3-minute video + clear slides

🧠 Week 8 Challenge: Example Capstone Project

Full Capstone Example: "What Drives Airbnb Pricing in NYC?" ### 📋 Project Overview **Question:** What factors most influence Airbnb listing prices in NYC? **Dataset:** [Inside Airbnb NYC](http://insideairbnb.com/get-the-data/) **Skills Demonstrated:** - SQL (data exploration) - Statistics (hypothesis testing) - Python (EDA, visualization) - Product thinking (recommendations) - Storytelling (presentation) --- ### 🔍 Part 1: Data Exploration (SQL) ```sql -- Load data into Postgres and explore -- (This would be in a staging layer in a real pipeline) -- Key stats by neighborhood WITH neighborhood_stats AS ( SELECT neighbourhood_group AS borough, neighbourhood, COUNT(*) AS listings, ROUND(AVG(price), 2) AS avg_price, ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY price), 2) AS median_price, ROUND(AVG(availability_365), 1) AS avg_availability FROM listings WHERE price BETWEEN 20 AND 1000 -- Filter outliers GROUP BY neighbourhood_group, neighbourhood ) SELECT borough, SUM(listings) AS total_listings, ROUND(AVG(avg_price), 2) AS borough_avg_price, ROUND(AVG(median_price), 2) AS borough_median_price FROM neighborhood_stats GROUP BY borough ORDER BY borough_avg_price DESC; ``` **Key Finding:** Manhattan has 2x the median price of other boroughs. --- ### 📊 Part 2: Hypothesis Testing (Python) **Hypothesis:** Superhosts charge higher prices than regular hosts. ```python import pandas as pd from scipy import stats # Load data listings = pd.read_csv('listings.csv') # Filter to reasonable price range listings = listings[(listings['price'] >= 20) & (listings['price'] <= 1000)] # Split by superhost status superhosts = listings[listings['host_is_superhost'] == 't']['price'] regular_hosts = listings[listings['host_is_superhost'] == 'f']['price'] # Two-sample t-test t_stat, p_value = stats.ttest_ind(superhosts, regular_hosts, equal_var=False) # Effect size (Cohen's d) pooled_std = ((superhosts.std()**2 + regular_hosts.std()**2) / 2) ** 0.5 cohens_d = (superhosts.mean() - regular_hosts.mean()) / pooled_std print(f"Superhost mean: ${superhosts.mean():.2f}") print(f"Regular host mean: ${regular_hosts.mean():.2f}") print(f"Difference: ${superhosts.mean() - regular_hosts.mean():.2f}") print(f"p-value: {p_value:.4f}") print(f"Cohen's d: {cohens_d:.3f}") ``` **Result:** - Superhosts charge $12 more on average (p < 0.001) - Effect size is small (d = 0.15), but statistically significant - **Interpretation:** The premium is real but modest --- ### 📈 Part 3: Feature Importance (Python) ```python import pandas as pd import numpy as np from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt # Prepare features features = ['accommodates', 'bedrooms', 'bathrooms', 'availability_365', 'number_of_reviews', 'reviews_per_month', 'minimum_nights'] # Create dummy variables for room type listings_encoded = pd.get_dummies(listings, columns=['room_type']) X = listings_encoded[features + [c for c in listings_encoded.columns if c.startswith('room_type_')]] y = listings_encoded['price'] # Handle missing values X = X.fillna(X.median()) # Train model X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1) rf.fit(X_train, y_train) # Feature importance importance_df = pd.DataFrame({ 'feature': X.columns, 'importance': rf.feature_importances_ }).sort_values('importance', ascending=False) print("Top 5 Price Drivers:") print(importance_df.head()) ``` **Key Finding:** Top 3 drivers are: 1. Number of bedrooms (35% importance) 2. Accommodates (25% importance) 3. Room type = Entire home (20% importance) --- ### 🎤 Part 4: The Presentation (5 Slides) #### Slide 1: The Problem > "Airbnb hosts often misprice their listings, leaving money on the table or sitting vacant. What actually drives price?" #### Slide 2: The Approach - Dataset: 50,000+ NYC Airbnb listings - Analysis: SQL exploration → Statistical testing → Feature modeling - Validation: 80/20 train-test split #### Slide 3: Key Insight > "**Bedrooms and capacity drive 60% of pricing** — but Superhost status only adds a $12 premium despite the badge." *Include a feature importance bar chart* #### Slide 4: Recommendation > "For hosts: Focus on accurately conveying capacity. For Airbnb: Consider making Superhost benefits more visible in search." #### Slide 5: Next Steps > "Proposed experiment: A/B test showing 'Superhost' badge more prominently in search results. Hypothesis: Will increase Superhost booking rate by 5%." --- ### 📁 Project Structure ``` airbnb-price-analysis/ ├── README.md # Overview, key findings, how to run ├── data/ │ └── listings.csv # Raw data (or download script) ├── notebooks/ │ ├── 01_exploration.ipynb # SQL-like exploration in Python │ ├── 02_hypothesis_tests.ipynb │ └── 03_feature_analysis.ipynb ├── sql/ │ └── exploration_queries.sql ├── presentation/ │ └── airbnb_pricing.pdf # Final slides └── requirements.txt ```

📝 Capstone Presentation Framework

The "SCR" Framework (Situation-Complication-Resolution) Every great analytics presentation follows this structure: **Situation (Slide 1):** Set the stage - What's the business context? - Why does this matter now? - What question are we answering? **Complication (Slide 2-3):** The twist - What did we find that was surprising? - What's the tension or trade-off? - What data supports this? **Resolution (Slide 4-5):** The recommendation - What should we do? - What's the expected impact? - What are the next steps or experiments? ### Example: "User Churn Analysis" | Section | Content | |---------|---------| | **Situation** | "Our 30-day churn rate has increased from 12% to 18% over the past quarter." | | **Complication** | "We found that 80% of churned users never completed onboarding. But surprisingly, users who skipped the tutorial had HIGHER retention than those who completed it halfway." | | **Resolution** | "Recommendation: Shorten the tutorial from 7 steps to 3. A/B test this with 20% of new users. Expected impact: Reduce churn by 3 percentage points." |
Storytelling Anti-Patterns (What NOT to Do) | ❌ Anti-Pattern | Why It's Bad | ✅ Do This Instead | |-----------------|--------------|-------------------| | "Here's a dashboard" | No narrative; audience draws wrong conclusions | "Here's what I found and what we should do" | | "The data shows..." (passive voice) | Weak; hides ownership | "I found..." or "We discovered..." | | Starting with methodology | Loses audience; no hook | Start with the insight or question | | 10 charts, 3 insights | Overwhelms; dilutes message | 3 charts maximum; one insight per slide | | "It's complicated" | Sounds like excuse | "The key factor is X, though Y also matters" | | No recommendation | Analysis without action is useless | Always end with "We should..." | ### The "So What?" Test For every slide, ask: "So what?" - ❌ "Users in California have higher LTV than Texas" — So what? - ✅ "Users in California have 40% higher LTV, which means we should increase CAC budget for CA by 20%" — Clear action!

💬 Week 8 Discussion Prompts

  1. “Share your capstone topic for feedback” — Get input before you’re too deep.
  2. “Post your presentation and ask for one improvement suggestion” — Practice receiving feedback.
  3. “What’s the hardest thing about storytelling with data?” — Share struggles and tips.

⚠️ Common Week 8 Mistakes

Mistake Why It’s Wrong Fix
Choosing too broad a topic Can’t complete in a week Narrow to one specific question
Spending 80% of time on data cleaning Shows in the presentation as rushed analysis Timebox cleaning; use a mostly-clean dataset
No recommendation Analysis without action is wasted work Every presentation needs “We should…”
Burying the insight Audience loses interest before the payoff Lead with the headline, then support it
Perfect is the enemy of done Never shipping = no portfolio piece Ship something imperfect; iterate later

✅ Week 8 Self-Assessment

Before claiming your Champion badge, confirm:

  • Does your capstone showcase SQL, stats, AND Python?
  • Does your presentation follow Situation → Complication → Resolution?
  • Does every slide answer “So what?”
  • Have you received feedback from at least one other person?
  • Is your work published (GitHub, LinkedIn, or personal site)?

🎯 Capstone Peer Review Checklist

When reviewing a peer’s capstone:

Category Question Score (1-5)
Clarity Is the question clearly stated in the first 30 seconds?  
Data Is the data source credible and appropriate?  
Analysis Does the analysis use appropriate methods for the question?  
Insight Is the key finding surprising, important, or actionable?  
Recommendation Is there a clear “We should…” statement?  
Storytelling Does the narrative flow logically?  
Visuals Are charts clear and labeled appropriately?  
Polish Is the code clean and notebook well-organized?  

Feedback format: “I liked [specific thing]. One suggestion: [specific improvement].”

Milestone: Capstone notebook + slide deck published. 🎉


Tracking Your Progress

GitHub Progress Tracker

Create a public GitHub repo called 2026-analytics-challenge to track your journey:

# Create your challenge repo
mkdir 2026-analytics-challenge
cd 2026-analytics-challenge
git init
echo "# 🎯 My 2026 Analytics Challenge Journey" > README.md
echo "" >> README.md
echo "![2026 Analytics Challenge](https://img.shields.io/badge/2026_Analytics_Challenge-Participant-blue?style=for-the-badge&logo=github&logoColor=white)" >> README.md

Progress Checklist

Copy this to your repo’s README or a Notion page:

## 🗓️ Challenge Progress

### Week 1: SQL Fundamentals
- [ ] Day 1: SELECT, WHERE, ORDER (3 problems)
- [ ] Day 2: JOINs (3 problems)
- [ ] Day 3: GROUP BY, HAVING (3 problems)
- [ ] Day 4: Subqueries (2 problems)
- [ ] Day 5: CTEs (3 problems)
- [ ] Day 6-7: Mini-project query
- [ ] **Posted Week 1 check-in to Discussions**

### Week 2: Window Functions & CTEs
- [ ] Day 8: ROW_NUMBER, RANK (2 problems)
- [ ] Day 9: LAG/LEAD (2 problems)
- [ ] Day 10: Running totals (2 problems)
- [ ] Day 11: NTILE, PERCENT_RANK
- [ ] Day 12: Cohort retention query
- [ ] Day 13-14: Visualization
- [ ] **Posted Week 2 check-in to Discussions**

### Week 3: Statistics
- [ ] Day 15: Distributions
- [ ] Day 16: CLT simulation
- [ ] Day 17: Hypothesis testing
- [ ] Day 18: p-values & CI explainer
- [ ] Day 19: Power & sample size
- [ ] Day 20-21: Sample size script
- [ ] **Posted Week 3 check-in to Discussions**

### Week 4: A/B Testing
- [ ] Day 22: Experiment design
- [ ] Day 23: Randomization
- [ ] Day 24: Analyze mock result
- [ ] Day 25: Edge cases
- [ ] Day 26: Common pitfalls
- [ ] Day 27-28: Full experiment doc
- [ ] **Posted Week 4 check-in to Discussions**
- [ ] 🎯 **HALFWAY BADGE EARNED!**

### Week 5: Python/Pandas
- [ ] Day 29: DataFrame basics
- [ ] Day 30: Data cleaning
- [ ] Day 31: Groupby & aggregation
- [ ] Day 32: Merging & joining
- [ ] Day 33: Visualization
- [ ] Day 34-35: EDA notebook
- [ ] **Posted Week 5 check-in to Discussions**

### Week 6: Product Metrics
- [ ] Day 36: Metrics frameworks
- [ ] Day 37: North Star metrics
- [ ] Day 38: Case study #1
- [ ] Day 39: Case study #2
- [ ] Day 40: Case study #3
- [ ] Day 41-42: Written case doc
- [ ] **Posted Week 6 check-in to Discussions**

### Week 7: Analytical Engineering
- [ ] Day 43: Data modeling layers
- [ ] Day 44: De-duplication
- [ ] Day 45: Idempotent loads
- [ ] Day 46: Data quality checks
- [ ] Day 47: Date spines
- [ ] Day 48-49: Pipeline project
- [ ] **Posted Week 7 check-in to Discussions**

### Week 8: Capstone
- [ ] Day 50: Choose topic
- [ ] Day 51: Data prep & EDA
- [ ] Day 52: Analysis
- [ ] Day 53: Build presentation
- [ ] Day 54: Present/record
- [ ] Day 55-56: Polish & publish
- [ ] **Posted Capstone to Discussions**
- [ ] 🏆 **CHAMPION BADGE EARNED!**

Social Accountability

Share weekly wins on social media with #2026AnalyticsChallenge for accountability.

LinkedIn Post Template:

🎯 Week [X] of #2026AnalyticsChallenge complete!

This week I learned:
• [Key insight 1]
• [Key insight 2]

Check out my work: [GitHub link]

Thanks to @Data-Science-Analytical-Handbook for the structure!

#DataScience #SQL #Analytics #LearningInPublic

What You’ll Have by March 1, 2026

Artifact Purpose
25+ SQL problems solved Interview-ready speed
Stats & sample-size scripts Demonstrate rigor
EDA + cohort notebooks Portfolio projects
Experiment write-up Show experimentation fluency
Staging → mart pipeline Analytical engineering proof
Capstone presentation Storytelling & executive presence
GitHub Challenge Badge Visible proof of commitment
3-Minute Video Demonstrates communication skills
Digital Certificate LinkedIn-ready proof of completion

📣 Community & Marketing

Cohort Start Dates (Join a Community)

Instead of going it alone, join an official cohort for accountability and community:

Cohort Start Date Registration Deadline
Winter 2026 January 1, 2026 December 25, 2025
Spring 2026 April 1, 2026 March 25, 2026
Fall 2026 September 1, 2026 August 25, 2026

Why Join a Cohort?

  • Weekly live AMAs (Twitter Spaces / LinkedIn Audio)
  • Dedicated Slack/Discord channel for your cohort
  • Peer accountability partners
  • Weekly “Spotlight” features for best submissions

The “Commitment Contract” (Social Accountability)

Want to 10x your chances of completing the challenge? Make a public commitment:

  1. Download this image and post it to LinkedIn/Twitter:

    🎯 I’m committed to the 2026 Analytics Challenge! For the next 8 weeks, I’ll be leveling up my SQL, Python, statistics, and storytelling skills. Follow my journey: #2026AnalyticsChallenge github.com/moshesham/Data-Science-Analytical-Handbook

  2. Star the repo ⭐ — helps others discover the challenge and unlocks bonus materials!

  3. Post weekly updates with #2026AnalyticsChallenge

Weekly Live AMAs

Every Friday during the challenge, join a 30-minute live session:

  • Platform: Twitter Spaces or LinkedIn Audio (links posted in Discussions)
  • Format: 10-min recap of the week + 20-min Q&A
  • Topics: That week’s content, common struggles, career advice

Spotlight Rewards

Each week, we feature the best submissions:

  • Best SQL Query (Week 1-2)
  • Best Statistical Explanation (Week 3)
  • Best Experiment Design (Week 4)
  • Best EDA Notebook (Week 5)
  • Best Case Study Answer (Week 6)
  • Best Pipeline (Week 7)
  • Best Capstone (Week 8)

Winners get featured in the main README and earn a special “Spotlight” badge!


🎖️ Hall of Fame

Completed the challenge? Get recognized!

  1. Post your capstone to Discussions
  2. Submit a PR adding your name to the Hall of Fame
  3. Earn your Champion badge for your profile
Name GitHub Capstone Project Completion Date
Your name here! @username Project title Feb 2026

Additional Resources


🎯 Start Your Journey Today

Don't wait for January 1st—start preparing now!

📝 Join the Challenge ⭐ Star the Repo

Good luck—see you on the other side of the challenge! 🚀

60 mins Intermediate