🎯 2026 New Year Resolution Challenge: Level Up Your Analytical Skills
Official Cohort Start Dates: January 1, 2026 | April 1, 2026 | September 1, 2026
Duration: 8 weeks
Commitment: ~1 hour/day on weekdays; flexible weekends for catch-up/projects
Goal: Enter Q2 2026 with demonstrably stronger analytics, SQL, Python, statistics, AI-augmented skills, and storytelling abilities.
💡 New for 2026: This challenge now includes an AI-Augmented Analyst Track — learn to leverage GenAI tools as a force multiplier, not a crutch.
🚀 Ready to Transform Your 2026?
Join hundreds of aspiring data professionals in this 8-week challenge.
📝 Sign Up & Introduce Yourself🚀 Quick Start & Environment Setup
One-Click Environments (Zero Local Setup)
Skip the pip install struggles! Use these browser-based environments:
For SQL Practice (Weeks 1-2):
- Use DuckDB WASM Shell — runs SQL in your browser, no setup!
- Load our sample data:
CREATE TABLE customers AS SELECT * FROM 'https://raw.githubusercontent.com/moshesham/Data-Science-Analytical-Handbook/main/supplementary/challenge-2026/data/customers.csv';
For Python Practice (Weeks 5, 8):
- Use Google Colab — free Jupyter notebooks in the cloud
- Or GitHub Codespaces — full VS Code in browser
Local Setup (Optional)
If you prefer local development:
# Clone the repository
git clone https://github.com/moshesham/Data-Science-Analytical-Handbook.git
cd Data-Science-Analytical-Handbook/supplementary/challenge-2026
# Option 1: pip
pip install -r requirements.txt
# Option 2: conda
conda env create -f environment.yml
conda activate analytics-challenge-2026
Sample Data
We’ve included ready-to-use datasets in data/:
customers.csv,orders.csv,products.csv— for SQL practiceuser_activity.csv— for cohort analysisdaily_metrics.csv,monthly_revenue.csv— for time series analysis
🤖 The AI-Augmented Analyst Track (NEW)
In 2026, analysts who can leverage AI tools effectively will have a massive advantage. This track teaches you to use GenAI as a force multiplier — not a replacement for critical thinking.
How It Works
Every week includes an optional AI challenge that teaches responsible AI-augmented analysis:
| Week | Traditional Skill | AI-Augmented Challenge |
|---|---|---|
| 1 | SQL Fundamentals | Prompt Engineering for SQL — Write a complex query, then ask Copilot/ChatGPT to optimize it. Compare execution plans using EXPLAIN ANALYZE. |
| 2 | Window Functions | SQL Review with AI — Generate window function queries with AI, then critique and fix the errors it makes. |
| 3 | Statistics | Synthetic Data Generation — Use Python faker or an LLM to generate a dataset with specific statistical properties. |
| 4 | A/B Testing | Experiment Design Review — Use AI to critique your experiment design doc. Challenge its suggestions. |
| 5 | Pandas/EDA | Refactoring with AI — Take your “spaghetti code” from Day 30 and use an AI agent to modularize it. |
| 6 | Product Metrics | Simulated Stakeholder — Submit your case study answer to ChatGPT playing the role of a skeptical PM. |
| 7 | Analytical Engineering | dbt Model Generation — Use AI to scaffold a dbt project, then customize and validate. |
| 8 | Capstone | AI-Assisted Storytelling — Use AI to critique your presentation structure, then improve it. |
Key Principles for AI-Augmented Analysis
- AI as Co-Pilot, Not Autopilot — Always verify AI outputs; it makes confident mistakes.
- Prompt Engineering is a Skill — Learn to write precise prompts that constrain the output.
- Domain Knowledge Matters More — AI doesn’t know your business context; you do.
- Show Your Reasoning — In interviews, explain how you validated AI-generated code.
🏆 Earn Your Challenge Badge
Complete the challenge and showcase your achievement! Add this badge to your GitHub profile README:
Challenge Badges
Participant Badge (Week 1+):

Halfway Hero (Week 4 Complete):

Challenge Champion (Full Completion):

Weekly Skill Badges:





📜 Digital Certificate
Upon completing the capstone (Week 8), you’ll receive a digital certificate to showcase on your LinkedIn profile:
- Post your capstone to the Week 8 Discussion thread
- Include these components: GitHub repo link, presentation slides/video, key insights
- Receive your certificate — we’ll generate a personalized PDF certificate for verified completions
🎓 Certificates include your name, completion date, capstone project title, and a verification link.
💬 Join the Community
Daily Check-Ins
Post your daily progress in our GitHub Discussions! Each day, share:
- ✅ What you completed
- 💡 One thing you learned
- ❓ Questions for the community
Weekly Threads
Every Monday, a new weekly thread goes live. Use it to:
- Share your mini-project
- Get feedback on your code
- Help others who are stuck
- Celebrate wins! 🎉
| Week | Discussion Thread |
|---|---|
| 1 | SQL Fundamentals Check-In |
| 2 | Window Functions Show & Tell |
| 3 | Stats Concepts Discussion |
| 4 | Experiment Design Review |
| 5 | EDA Notebook Showcase |
| 6 | Case Study Solutions |
| 7 | Pipeline Demos |
| 8 | Capstone Presentations 🎓 |
Help Others, Level Up Together
- 👍 Upvote helpful answers
- 💬 Comment with encouragement
- 🔧 Review a peer’s SQL query or notebook
- 🏅 Top contributors get featured in the repo!
Why This Challenge Works
| Principle | How It’s Applied |
|---|---|
| Consistency beats intensity | 1 focused hour daily > weekend cram sessions |
| Learn by doing | Every day has a concrete deliverable |
| Public accountability | Track progress on GitHub/LinkedIn; share wins |
| Build a portfolio | End with 4 polished mini-projects to show |
Challenge Overview
| Week | Theme | Outcome | 🆕 2026 Innovation |
|---|---|---|---|
| 1 | SQL Fundamentals Refresher | Solve 15 SQL problems, from easy → medium | Query Cost Optimization with EXPLAIN ANALYZE |
| 2 | SQL Mastery: Window Functions & CTEs | Complete 10 advanced problems; build a cohort/retention query | Visual SQL — see query results as charts |
| 3 | Statistics for Product Analytics | Explain p-values, CI, power; design a sample size calculator | Peeking Simulation — visualize why peeking is bad |
| 4 | A/B Testing & Experimentation | Design & analyze a mock experiment end-to-end | Multi-Armed Bandits — modern alternative to A/B |
| 5 | Python (Pandas) for Analysis | Clean, explore, and visualize a real dataset | Polars — high-performance DataFrame alternative |
| 6 | Product Metrics & Case Studies | Define metrics frameworks; crack 3 case studies | Metric Trees — hierarchical metric visualization |
| 7 | Analytical Engineering Basics | Build a staging → mart pipeline in SQL | dbt Core — industry-standard transformation tool |
| 8 | Capstone & Storytelling | Deliver a 5-slide insight presentation | Video Walkthrough — 3-min Loom requirement |
Week 1: SQL Fundamentals Refresher
Goal: Knock the rust off core SQL and solve problems quickly.
📂 Sample Data: Use our ready-to-use datasets in
challenge-2026/data/— no downloading required!
| Day | Focus | Do This | Resource |
|---|---|---|---|
| 1 | SELECT, WHERE, ORDER | Solve 3 easy SQL problems on DataLemur or LeetCode | DataLemur |
| 2 | JOINs (INNER, LEFT) | Diagram how LEFT JOIN works; solve 3 join problems | SQLBolt |
| 3 | GROUP BY, HAVING | Solve 3 aggregation problems; explain the difference from WHERE | Mode Tutorial |
| 4 | Subqueries | Rewrite a subquery as a CTE; solve 2 problems both ways | LeetCode |
| 5 | CTEs + EXPLAIN ANALYZE | Solve 3 CTE problems; learn to read execution plans | DataLemur |
| 6–7 | Catch-up / Mini-project | Write a single CTE-based query that answers a business question on a public dataset (e.g., Kaggle) | Kaggle Datasets |
🆕 Query Cost Optimization (Distinguishes Senior Analysts)
Learning to read EXPLAIN ANALYZE output early sets you apart. Here’s why it matters:
-- Run this on any query to see its execution plan
EXPLAIN ANALYZE
SELECT c.name, COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name
ORDER BY order_count DESC;
What to look for:
- Seq Scan vs Index Scan — Index scans are usually faster for filtered queries
- Nested Loop vs Hash Join — Hash joins are better for large tables
- Actual Time — The “actual time” tells you how long each step took
- Rows — Compare “estimated” vs “actual” rows to spot statistics issues
Pro Tip: In DuckDB (browser-based), use EXPLAIN to see the logical plan:
EXPLAIN SELECT * FROM orders WHERE amount > 100;
🤖 AI-Augmented Challenge (Week 1)
Prompt Engineering for SQL:
- Write a complex query with 3+ JOINs and aggregations
- Ask ChatGPT/Copilot: “Optimize this query for performance. Explain what you changed and why.”
- Compare the AI’s version to yours using
EXPLAIN ANALYZE - Reflection: Did the AI improve performance? Did it break correctness?
🧠 Week 1 Challenge Problems
Challenge 1: Customer Order Analysis (Click to expand)
**Problem:** Given tables `customers` and `orders`, find customers who placed more than 3 orders in the last 30 days but have never ordered on a weekend. ```sql -- Schema CREATE TABLE customers (customer_id INT, name TEXT, signup_date DATE); CREATE TABLE orders (order_id INT, customer_id INT, order_date DATE, amount DECIMAL); -- Sample Data INSERT INTO customers VALUES (1, 'Alice', '2024-01-15'), (2, 'Bob', '2024-06-01'); INSERT INTO orders VALUES (101, 1, '2025-12-01', 50), (102, 1, '2025-12-05', 30), (103, 1, '2025-12-10', 45), (104, 1, '2025-12-15', 60), (105, 2, '2025-12-07', 100); -- Saturday ``` **Your Task:** Write the query before looking at the solution.✅ Solution
```sql WITH recent_orders AS ( SELECT customer_id, COUNT(*) AS order_count, COUNT(*) FILTER (WHERE EXTRACT(DOW FROM order_date) IN (0, 6)) AS weekend_orders FROM orders WHERE order_date >= CURRENT_DATE - INTERVAL '30 days' GROUP BY customer_id ) SELECT c.customer_id, c.name, ro.order_count FROM customers c JOIN recent_orders ro ON c.customer_id = ro.customer_id WHERE ro.order_count > 3 AND ro.weekend_orders = 0; ``` **Key Concepts:** FILTER clause, EXTRACT(DOW), date arithmetic, CTE for readability.Challenge 2: Revenue Gap Analysis
**Problem:** Find all dates in December 2025 where no orders were placed. Return the date and which day of the week it was. **Your Task:** This requires generating a date series. Think about how to create dates that don't exist in your data.✅ Solution
```sql WITH date_spine AS ( SELECT generate_series('2025-12-01'::date, '2025-12-31'::date, '1 day')::date AS dt ), order_dates AS ( SELECT DISTINCT order_date FROM orders WHERE order_date BETWEEN '2025-12-01' AND '2025-12-31' ) SELECT ds.dt AS missing_date, TO_CHAR(ds.dt, 'Day') AS day_of_week FROM date_spine ds LEFT JOIN order_dates od ON ds.dt = od.order_date WHERE od.order_date IS NULL ORDER BY ds.dt; ``` **Key Concepts:** generate_series (Postgres), date spine pattern, LEFT JOIN to find gaps.Challenge 3: Self-Join for User Referrals
**Problem:** Table `users` has columns (user_id, name, referred_by_user_id). Find all users who referred at least 2 other users, along with the names of who they referred.✅ Solution
```sql WITH referral_counts AS ( SELECT referred_by_user_id AS referrer_id, COUNT(*) AS referral_count FROM users WHERE referred_by_user_id IS NOT NULL GROUP BY referred_by_user_id HAVING COUNT(*) >= 2 ) SELECT referrer.name AS referrer_name, STRING_AGG(referred.name, ', ' ORDER BY referred.name) AS referred_users, rc.referral_count FROM referral_counts rc JOIN users referrer ON rc.referrer_id = referrer.user_id JOIN users referred ON referred.referred_by_user_id = rc.referrer_id GROUP BY referrer.name, rc.referral_count; ``` **Key Concepts:** Self-join, HAVING for post-aggregation filter, STRING_AGG for concatenation.💬 Week 1 Discussion Prompts
Post your answers in the Week 1 Discussion Thread:
- “What’s the most confusing JOIN scenario you’ve encountered?” — Share and help others understand.
- “Subquery vs CTE: When do you choose which?” — There’s no single right answer; discuss trade-offs.
- “Share your Day 6-7 mini-project query” — Get peer feedback on readability and correctness.
⚠️ Common Week 1 Mistakes
| Mistake | Why It’s Wrong | Fix |
|---|---|---|
WHERE after GROUP BY for aggregates |
WHERE filters rows before grouping |
Use HAVING for aggregate conditions |
Forgetting NULL handling in JOINs |
LEFT JOIN creates NULLs; comparisons fail | Use COALESCE() or IS NULL checks |
| Not aliasing tables in self-joins | Query becomes unreadable, may error | Always alias: users u1 JOIN users u2 |
Using SELECT * in production |
Performance issues, schema changes break code | Explicitly list needed columns |
✅ Week 1 Self-Assessment
Before moving to Week 2, can you:
- Write a 3-table JOIN without looking up syntax?
- Explain the difference between WHERE and HAVING in one sentence?
- Convert any subquery to a CTE (and know when not to)?
- Use COALESCE to handle NULLs in your output?
Milestone: 15 SQL problems solved; 1 mini-project query committed to GitHub.
Week 2: SQL Mastery – Window Functions & CTEs
Goal: Become fluent in ROW_NUMBER, RANK, LAG/LEAD, running totals, and cohort queries.
| Day | Focus | Do This | Resource |
|---|---|---|---|
| 8 | ROW_NUMBER, RANK, DENSE_RANK | Explain the differences; solve 2 “top-N per group” problems | Mode Window Funcs |
| 9 | LAG / LEAD | Compute day-over-day change; solve 2 problems | StrataScratch |
| 10 | Running Totals / Averages | Use SUM() OVER with frames; solve 2 problems | DataLemur |
| 11 | NTILE, PERCENT_RANK | Bucket users into deciles; explain use case | Mode Tutorial |
| 12 | Cohort & Retention Query | Build a retention matrix query for a sample events table | Handbook: Cohort Analysis |
| 13–14 | Catch-up / Mini-project | Publish a cohort retention query + heatmap visual (Python or Sheets) | Your GitHub |
🆕 Visual SQL (See Your Data Immediately)
Don’t just write queries — visualize the results to build intuition. Window functions are much easier to understand when you can see the data change.
Option 1: DuckDB + Python (Recommended)
import duckdb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load data and run window function query
con = duckdb.connect()
df = con.execute("""
SELECT
dt,
dau,
AVG(dau) OVER (ORDER BY dt ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS trailing_7d_avg
FROM 'data/daily_metrics.csv'
ORDER BY dt
""").fetchdf()
# Visualize immediately
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(df['dt'], df['dau'], label='Daily DAU', alpha=0.7)
ax.plot(df['dt'], df['trailing_7d_avg'], label='7-Day Trailing Avg', linewidth=2)
ax.legend()
ax.set_title('Daily Active Users with 7-Day Moving Average')
plt.show()
Option 2: Metabase (Free, Local)
- Run
docker run -d -p 3000:3000 metabase/metabase - Connect to your local database
- Build charts directly from SQL queries
🤖 AI-Augmented Challenge (Week 2)
SQL Code Review with AI:
- Write a complex window function query (e.g., cohort retention)
- Ask ChatGPT: “Review this SQL query for correctness, performance, and readability. Point out any edge cases I might have missed.”
- Critical step: Test the AI’s suggestions! Does the query still produce correct results?
- Reflection: What did the AI catch that you missed? What did it get wrong?
🧠 Week 2 Challenge Problems
Challenge 1: Top 3 Products per Category (Classic Interview Question)
**Problem:** Given a `products` table with (product_id, category, product_name, revenue), find the top 3 products by revenue in each category. Handle ties by including all tied products. ```sql -- Sample Data CREATE TABLE products ( product_id INT, category TEXT, product_name TEXT, revenue DECIMAL ); INSERT INTO products VALUES (1, 'Electronics', 'Laptop', 1200), (2, 'Electronics', 'Phone', 800), (3, 'Electronics', 'Tablet', 800), -- Tie with Phone (4, 'Electronics', 'Headphones', 150), (5, 'Electronics', 'Watch', 300), (6, 'Clothing', 'Jacket', 200), (7, 'Clothing', 'Shoes', 180); ```✅ Solution
```sql WITH ranked AS ( SELECT *, DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS rnk FROM products ) SELECT category, product_name, revenue, rnk FROM ranked WHERE rnk <= 3 ORDER BY category, rnk; ``` **Why DENSE_RANK?** If two products tie for 2nd place, the next product is still 3rd (not 4th). Use `RANK()` if you want to skip numbers, `ROW_NUMBER()` if you want exactly 3 rows regardless of ties. | Function | Tie Handling | Use Case | |----------|--------------|----------| | ROW_NUMBER | No ties (arbitrary) | Exactly N rows needed | | RANK | Skips numbers after tie | Competition rankings | | DENSE_RANK | No gaps after tie | "Top N including ties" |Challenge 2: Month-over-Month Growth Rate
**Problem:** Calculate the month-over-month revenue growth percentage. Handle the first month (no prior month) gracefully. ```sql CREATE TABLE monthly_revenue (month DATE, revenue DECIMAL); INSERT INTO monthly_revenue VALUES ('2025-01-01', 10000), ('2025-02-01', 12000), ('2025-03-01', 11500), ('2025-04-01', 15000); ```✅ Solution
```sql SELECT month, revenue, LAG(revenue) OVER (ORDER BY month) AS prev_revenue, ROUND( 100.0 * (revenue - LAG(revenue) OVER (ORDER BY month)) / NULLIF(LAG(revenue) OVER (ORDER BY month), 0), 2 ) AS mom_growth_pct FROM monthly_revenue ORDER BY month; ``` **Key Concepts:** - `LAG(revenue) OVER (ORDER BY month)` gets the previous row's value - `NULLIF(..., 0)` prevents division by zero - First row will show `NULL` for growth (no prior month)Challenge 3: Running 7-Day Average (Real-World Analytics)
**Problem:** Calculate a 7-day trailing average of daily active users (DAU). This is extremely common in product analytics dashboards. ```sql CREATE TABLE daily_metrics (dt DATE, dau INT); INSERT INTO daily_metrics VALUES ('2025-12-01', 1000), ('2025-12-02', 1050), ('2025-12-03', 980), ('2025-12-04', 1100), ('2025-12-05', 1200), ('2025-12-06', 900), ('2025-12-07', 1150), ('2025-12-08', 1300), ('2025-12-09', 1250); ```✅ Solution
```sql SELECT dt, dau, ROUND( AVG(dau) OVER ( ORDER BY dt ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ), 2 ) AS trailing_7d_avg, COUNT(*) OVER ( ORDER BY dt ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) AS days_in_window FROM daily_metrics ORDER BY dt; ``` **Key Insight:** The `ROWS BETWEEN 6 PRECEDING AND CURRENT ROW` creates a 7-day window (current + 6 prior). The `days_in_window` column shows how many days are actually in the calculation—useful for the first 6 days when the window is partial. **Alternative with RANGE (be careful!):** ```sql -- RANGE looks at VALUES, not row positions -- Use ROWS for day-based windows to handle missing dates correctly ```Challenge 4: Cohort Retention Matrix (Advanced)
**Problem:** Build a retention matrix showing what percentage of users from each signup cohort (month) were active in subsequent months. ```sql CREATE TABLE user_activity ( user_id INT, signup_date DATE, activity_date DATE ); -- Assume data for users who signed up and logged in on various dates ```✅ Solution
```sql WITH cohorts AS ( SELECT user_id, DATE_TRUNC('month', signup_date) AS cohort_month FROM user_activity GROUP BY user_id, DATE_TRUNC('month', signup_date) ), activity AS ( SELECT user_id, DATE_TRUNC('month', activity_date) AS activity_month FROM user_activity GROUP BY user_id, DATE_TRUNC('month', activity_date) ), cohort_activity AS ( SELECT c.cohort_month, a.activity_month, COUNT(DISTINCT c.user_id) AS active_users, -- Months since signup (0 = signup month) EXTRACT(YEAR FROM AGE(a.activity_month, c.cohort_month)) * 12 + EXTRACT(MONTH FROM AGE(a.activity_month, c.cohort_month)) AS month_number FROM cohorts c JOIN activity a ON c.user_id = a.user_id AND a.activity_month >= c.cohort_month GROUP BY c.cohort_month, a.activity_month ), cohort_sizes AS ( SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size FROM cohorts GROUP BY cohort_month ) SELECT ca.cohort_month, ca.month_number, ca.active_users, cs.cohort_size, ROUND(100.0 * ca.active_users / cs.cohort_size, 1) AS retention_pct FROM cohort_activity ca JOIN cohort_sizes cs ON ca.cohort_month = cs.cohort_month ORDER BY ca.cohort_month, ca.month_number; ``` **Visualization:** Take this output and pivot it in Python/Excel to create a retention heatmap where rows = cohort months, columns = month_number (0, 1, 2...), cells = retention_pct.💬 Week 2 Discussion Prompts
- “ROWS vs RANGE in window frames—when does it matter?” — Share an example where they produce different results.
- “What’s the trickiest window function problem you’ve solved?” — Post the problem and your approach.
- “Share your cohort retention visualization” — Get feedback on clarity and insights.
⚠️ Common Week 2 Mistakes
| Mistake | Why It’s Wrong | Fix |
|---|---|---|
Forgetting PARTITION BY |
Window function runs over entire table | Add PARTITION BY for per-group calculations |
| Using RANK when you need exactly N | RANK can return more than N rows on ties | Use ROW_NUMBER for strict limit |
| Window frame includes future rows | Default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW |
Explicitly set ROWS BETWEEN |
| Calculating retention without distinct users | One user with 10 logins ≠ 10 retained users | Always COUNT(DISTINCT user_id) |
✅ Week 2 Self-Assessment
Before moving to Week 3, can you:
- Explain the difference between RANK, DENSE_RANK, and ROW_NUMBER with an example?
- Calculate a trailing 7-day average using the correct window frame?
- Build a cohort retention query from scratch?
- Use LAG/LEAD to compute period-over-period changes?
Milestone: 10 advanced SQL problems; retention query + visualization.
Week 3: Statistics for Product Analytics
Goal: Build intuition for the stats used every day in experimentation.
| Day | Focus | Do This | Resource |
|---|---|---|---|
| 15 | Distributions: Normal, Binomial | Watch StatQuest; sketch 3 product examples | StatQuest: Normal |
| 16 | Central Limit Theorem | Simulate sample means in Python; see CLT in action | Handbook: Statistics |
| 17 | Hypothesis Testing (t-test, z-test) | Calculate by hand; then verify with scipy | Handbook: Hypothesis Testing |
| 18 | p-values & Confidence Intervals | Write a 1-paragraph explanation a PM would understand | Own notes |
| 19 | Power & Sample Size | Use the handbook calculator or write your own formula | Sample Size Calculator |
| 20–21 | Catch-up / Mini-project | Create a Python script or notebook that computes sample size given α, β, baseline, MDE | Your GitHub |
🆕 The Peeking Simulation (Why Early Stopping is Dangerous)
This visual simulation proves why “peeking” at A/B test results inflates false positive rates. Run it yourself to build intuition.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
def simulate_peeking(n_simulations=1000, true_effect=0,
sample_size_per_peek=100, n_peeks=10, alpha=0.05):
"""
Simulate A/B tests with peeking vs waiting for full sample.
true_effect=0 means there's NO real effect (null hypothesis is true).
"""
false_positives_peeking = 0
false_positives_proper = 0
for _ in range(n_simulations):
# Generate full dataset for both groups
control = np.random.normal(0, 1, sample_size_per_peek * n_peeks)
treatment = np.random.normal(true_effect, 1, sample_size_per_peek * n_peeks)
# Peeking: check at each peek point, stop if significant
peeking_significant = False
for peek in range(1, n_peeks + 1):
n = sample_size_per_peek * peek
_, p_value = stats.ttest_ind(control[:n], treatment[:n])
if p_value < alpha:
peeking_significant = True
break # Stop at first significant result
if peeking_significant:
false_positives_peeking += 1
# Proper: wait for full sample
_, p_value_full = stats.ttest_ind(control, treatment)
if p_value_full < alpha:
false_positives_proper += 1
return {
'peeking_fp_rate': false_positives_peeking / n_simulations,
'proper_fp_rate': false_positives_proper / n_simulations
}
# Run simulation
results = simulate_peeking(n_simulations=1000, true_effect=0, n_peeks=10)
print(f"False Positive Rate (Peeking): {results['peeking_fp_rate']:.1%}")
print(f"False Positive Rate (Proper): {results['proper_fp_rate']:.1%}")
print(f"Expected (α=0.05): 5.0%")
Expected Output:
False Positive Rate (Peeking): ~20-30% ← Much higher than expected!
False Positive Rate (Proper): ~5% ← As expected
Expected (α=0.05): 5.0%
The Insight: With 10 peeks, your false positive rate can inflate to 20-30% even though you’re using α=0.05. This is why pre-registration and sequential testing methods matter.
🤖 AI-Augmented Challenge (Week 3)
Synthetic Data Generation:
- Use Python’s
fakerlibrary or an LLM to generate a dataset with specific properties:- 10,000 users
- Conversion rate of 3.5% with a specific distribution
- Include confounding variables (e.g., users from paid ads convert higher)
- Prompt for LLM: “Generate Python code to create a synthetic A/B test dataset with 10,000 users, a control group with 3% conversion and treatment with 3.5% conversion, including user attributes like signup source and device type that correlate with conversion.”
- Run your hypothesis testing code on the synthetic data
- Reflection: Did you detect the 0.5 percentage point difference? What sample size did you need?
🧠 Week 3 Challenge Problems
Challenge 1: Explain This to a PM (Communication Exercise)
**Scenario:** Your PM asks: "We ran an A/B test and got a p-value of 0.03. The data scientist says it's significant but we should wait for more data. That doesn't make sense—can you explain?" **Your Task:** Write a 3-4 sentence explanation that: 1. Explains what p=0.03 actually means 2. Explains why waiting might still be the right call 3. Uses no jargon a PM wouldn't understand✅ Sample Answer
"A p-value of 0.03 means there's about a 3% chance we'd see results this extreme if the feature actually had no effect. That sounds convincing, but here's the catch: if we check our experiment results multiple times before it's 'done,' we inflate our chances of seeing a false positive. It's like flipping a coin 10 times—if you stop the moment you get 3 heads in a row and declare the coin biased, you'll often be wrong. The data scientist likely wants to wait until we hit our pre-planned sample size to avoid this 'peeking' problem. Think of it like not opening the oven door while the cake is baking—checking early can ruin the result." **Key Concepts:** Peeking/multiple testing problem, pre-registration of sample size.Challenge 2: Sample Size Calculation (By Hand)
**Problem:** You're planning an A/B test for a checkout flow change. Current conversion rate is 3.2%. You want to detect a 10% relative lift (to 3.52%) with 80% power at α=0.05. **Calculate the required sample size per variant.**✅ Solution
**Formula for proportions (two-tailed):** $$n = \frac{2 \cdot (Z_{\alpha/2} + Z_{\beta})^2 \cdot \bar{p}(1-\bar{p})}{(\Delta)^2}$$ Where: - $Z_{\alpha/2} = 1.96$ (for α=0.05, two-tailed) - $Z_{\beta} = 0.84$ (for 80% power) - $p_1 = 0.032$, $p_2 = 0.0352$ - $\bar{p} = (p_1 + p_2)/2 = 0.0336$ - $\Delta = p_2 - p_1 = 0.0032$ **Calculation:** $$n = \frac{2 \cdot (1.96 + 0.84)^2 \cdot 0.0336 \cdot 0.9664}{(0.0032)^2}$$ $$n = \frac{2 \cdot 7.84 \cdot 0.0325}{0.00001024}$$ $$n = \frac{0.5096}{0.00001024} \approx 49,766$$ **Answer:** ~50,000 users per variant (100,000 total). **Python verification:** ```python from statsmodels.stats.power import NormalIndPower from statsmodels.stats.proportion import proportion_effectsize effect_size = proportion_effectsize(0.032, 0.0352) analysis = NormalIndPower() n = analysis.solve_power(effect_size=effect_size, alpha=0.05, power=0.8, alternative='two-sided') print(f"Sample size per group: {n:.0f}") # ~49,800 ```Challenge 3: Interpreting Confidence Intervals
**Problem:** An experiment shows: - Control conversion: 4.2% (95% CI: 3.9% – 4.5%) - Treatment conversion: 4.6% (95% CI: 4.3% – 4.9%) The PM says "Treatment is clearly better—the CIs don't overlap!" **Questions:** 1. Is the PM's logic correct? 2. What's the actual test for significance here? 3. What additional information would you want?✅ Solution
**1. Is the PM's logic correct?** Partially. Non-overlapping CIs *do* indicate significance, but overlapping CIs don't necessarily mean non-significance! The rule of thumb is unreliable. You need to look at the CI of the *difference*. **2. The actual test:** Calculate the confidence interval for the difference: - Point estimate: 4.6% - 4.2% = 0.4 percentage points - SE of difference: √(SE₁² + SE₂²) - If the CI of the difference excludes 0, it's significant **3. Additional information needed:** - Sample sizes for each group - The CI for the *lift* or *difference*, not individual proportions - Whether this was a one-tailed or two-tailed test - Was the test pre-registered with a specific MDE? **Key Insight:** Always report the CI for the treatment effect, not just individual group CIs.Challenge 4: Power Analysis Intuition
**Problem:** Match each scenario to whether it INCREASES or DECREASES required sample size: | Scenario | Effect on Sample Size | |----------|----------------------| | A. Increasing desired power from 80% to 90% | ? | | B. Decreasing significance level from 0.05 to 0.01 | ? | | C. Baseline conversion rate moves from 5% to 50% | ? | | D. MDE changes from 5% relative to 10% relative | ? | | E. Switching from two-tailed to one-tailed test | ? |✅ Solutions
| Scenario | Effect | Why | |----------|--------|-----| | A. 80% → 90% power | **INCREASES** | Need more data to reduce false negatives | | B. α from 0.05 → 0.01 | **INCREASES** | Stricter threshold requires more evidence | | C. Baseline 5% → 50% | **INCREASES** | Higher variance at p=0.5 (p(1-p) maximized) | | D. MDE 5% → 10% | **DECREASES** | Larger effects are easier to detect | | E. Two-tailed → one-tailed | **DECREASES** | All alpha is on one side; easier to detect | **Memory trick:** Anything that makes detection "harder" (higher power, lower alpha, smaller effect, more variance) increases sample size.💬 Week 3 Discussion Prompts
- “Share your PM-friendly explanation of p-values” — Let others critique for clarity.
- “What’s the biggest statistical mistake you’ve seen in A/B testing?” — Learn from real-world errors.
- “Post your sample size calculator code” — Compare implementations.
⚠️ Common Week 3 Mistakes
| Mistake | Why It’s Wrong | Fix |
|---|---|---|
| “p=0.05 means 5% chance result is due to chance” | p-value is P(data|H₀), not P(H₀|data) | p-value = probability of seeing this data if null is true |
| Using 80% power as universal default | May need 90%+ for high-stakes decisions | Match power to business cost of false negatives |
| Ignoring multiple testing | 20 metrics at α=0.05 → expect 1 false positive | Apply Bonferroni or FDR correction |
| Confusing statistical vs practical significance | p=0.001 doesn’t mean the effect matters | Always report effect size with confidence interval |
✅ Week 3 Self-Assessment
Before moving to Week 4, can you:
- Explain p-value, power, and CI to a non-technical stakeholder?
- Calculate sample size by hand for a proportion-based test?
- Identify when confidence intervals are misleading?
- Explain why “peeking” at experiment results inflates false positives?
Milestone: Explain CLT, p-value, power in plain English; sample size script.
Week 4: A/B Testing & Experimentation
Goal: Design, size, and analyze an experiment end-to-end.
| Day | Focus | Do This | Resource |
|---|---|---|---|
| 22 | Experiment Design | Define hypothesis, unit, success & guardrail metrics for a feature idea | Airbnb Blog |
| 23 | Randomization & Assignment | Explain salt-based hashing; list common pitfalls | Netflix Tech Blog |
| 24 | Analyze a Mock Result | Given sample data, compute lift, CI, p-value; state conclusion | Handbook: A/B Test Project |
| 25 | Interpreting Edge Cases | What if success metric up but guardrail down? Write decision framework | Own notes |
| 26 | Common Pitfalls (peeking, SRM) | List 5 mistakes and how to avoid them | Trustworthy Online Experiments (book) |
| 27–28 | Catch-up / Mini-project | Write up a full experiment doc: hypothesis → design → analysis → decision | Your GitHub |
🆕 Multi-Armed Bandits (Modern Alternative to A/B Testing)
Traditional A/B tests run to a fixed sample size. Multi-Armed Bandits (MAB) dynamically allocate traffic to better-performing variants, reducing regret (the cost of showing worse variants).
When to Use MAB vs A/B:
| Scenario | Best Approach | Why |
|---|---|---|
| High stakes, need statistical rigor | A/B Test | Fixed sample, clean interpretation |
| Low stakes, want to minimize regret | MAB | Adaptive allocation, faster “winner” |
| Short-lived (< 2 weeks) | MAB | Less time to waste on losers |
| Long-term, need precise effect size | A/B Test | MAB doesn’t give clean CIs |
| Many variants (5+) | MAB | Efficiently prunes losers |
Simple Thompson Sampling Example:
import numpy as np
from scipy import stats
class ThompsonSampling:
"""Simple Thompson Sampling for Bernoulli bandits (conversions)."""
def __init__(self, n_variants):
# Beta(1,1) = uniform prior
self.successes = np.ones(n_variants) # alpha
self.failures = np.ones(n_variants) # beta
def choose_variant(self):
"""Sample from posterior and pick the best."""
samples = [np.random.beta(s, f) for s, f in zip(self.successes, self.failures)]
return np.argmax(samples)
def update(self, variant, converted):
"""Update posterior with new observation."""
if converted:
self.successes[variant] += 1
else:
self.failures[variant] += 1
def get_conversion_rates(self):
"""Get estimated conversion rates (posterior mean)."""
return self.successes / (self.successes + self.failures)
# Simulate a campaign with 3 variants
bandit = ThompsonSampling(n_variants=3)
true_rates = [0.03, 0.035, 0.04] # Variant 2 is best
for _ in range(10000):
chosen = bandit.choose_variant()
converted = np.random.random() < true_rates[chosen]
bandit.update(chosen, converted)
print("Estimated rates:", bandit.get_conversion_rates())
print("True rates: ", true_rates)
Key Insight: Over time, Thompson Sampling naturally allocates more traffic to the best variant while still exploring others.
🤖 AI-Augmented Challenge (Week 4)
Experiment Design Review with AI:
- Write a complete experiment design document (hypothesis, metrics, sample size, etc.)
- Submit it to ChatGPT with this prompt: “You are a skeptical Staff Data Scientist reviewing this experiment design. Ask me 5 challenging questions about potential flaws, edge cases, or metrics I might have missed.”
- Answer each question thoughtfully
- Reflection: Did the AI raise concerns you hadn’t considered? Were any of its concerns misguided?
🧠 Week 4 Challenge Problems
Challenge 1: Design a Complete Experiment (Meta-Style Interview Question)
**Scenario:** Instagram is considering adding a "Super Like" feature where users can super-like one post per day. This would notify the poster with a special animation. **Your Task:** Write a complete experiment design document covering: 1. **Hypothesis** (one sentence) 2. **Randomization unit** (user, device, session?) and why 3. **Success metrics** (1-2 primary, 2-3 secondary) 4. **Guardrail metrics** (what could go wrong?) 5. **Sample size estimate** (use reasonable assumptions) 6. **Duration** and what could extend it✅ Sample Design Document
**1. Hypothesis:** Adding a Super Like feature will increase meaningful engagement (measured by comments and DMs) by 3% without negatively impacting daily active users or time spent. **2. Randomization Unit:** **User-level** randomization, not session or device: - Super Like is a persistent feature (once you use it, you can't un-use it) - Need consistent experience across devices/sessions - Avoids contamination from shared devices **3. Success Metrics:** | Metric | Type | Rationale | |--------|------|-----------| | Comments per DAU | Primary | Super Likes should spark conversations | | DMs sent per DAU | Primary | Notifications may drive private messages | | Super Likes sent per DAU | Secondary | Feature adoption rate | | Super Likes received per DAU | Secondary | Distribution of engagement | | Posts created per DAU | Secondary | Does receiving Super Likes encourage posting? | **4. Guardrail Metrics:** | Guardrail | Concern | |-----------|---------| | DAU | Feature feels spammy → users leave | | Time spent per session | Users spend time on Super Like UI → less feed time | | Regular likes per DAU | Super Like cannibalizes regular engagement | | Report rate | Super Like used for harassment | | Unfollow rate | Super Like notifications are annoying | **5. Sample Size Estimate:** - Baseline comments/DAU: 0.8 - MDE: 3% relative → 0.024 absolute increase - Variance: Assume Poisson, σ² ≈ μ → σ = 0.89 - With α=0.05, power=80%: ~50,000 users per variant **6. Duration:** - Minimum 2 weeks (full weekly cycle × 2) - Extend if: - Novelty effect suspected (engagement decreasing over time in treatment) - Holiday or major event during test period - Sample ratio mismatch detectedChallenge 2: Analyze This Experiment Result
**You ran an experiment with these results:** | Metric | Control (n=50,000) | Treatment (n=50,000) | Lift | 95% CI | p-value | |--------|-------------------|---------------------|------|--------|---------| | Conversion Rate | 3.20% | 3.45% | +7.8% | [+2.1%, +13.5%] | 0.008 | | Revenue per User | $2.10 | $2.08 | -0.9% | [-4.2%, +2.4%] | 0.62 | | Page Load Time | 1.2s | 1.8s | +50% | [+48%, +52%] | <0.001 | **Questions:** 1. Should you ship this feature? Why or why not? 2. What follow-up analysis would you do? 3. How would you present this to leadership?✅ Analysis Framework
**1. Should you ship?** **No, not without further investigation.** Here's the breakdown: | Metric | Verdict | Reasoning | |--------|---------|-----------| | Conversion ✅ | Good | +7.8% is statistically significant and meaningful | | Revenue ⚠️ | Neutral | No significant change; concerning given conversion increase | | Page Load ❌ | Bad | 50% slower is a major UX regression and SEO risk | **The revenue puzzle:** If conversion is up 7.8% but revenue is flat, that means: - Revenue per converter went *down* - Lower-quality conversions? Smaller order values? - Need to segment by customer type **The page load issue:** This is likely the root cause of the revenue problem: - Slower pages = frustrated users = lower conversion quality - Google penalizes slow pages in search rankings - Must fix before shipping **2. Follow-up Analysis:** - Segment conversion by new vs returning users - Check revenue per *converted* user (not all users) - Investigate why treatment is slower (A/B test infrastructure issue? Feature itself?) - Look at bounce rate and time-on-page metrics **3. Presenting to Leadership:** *"The test showed promising conversion gains (+7.8%), but we identified a technical issue: the treatment variant loads 50% slower. This likely explains why revenue didn't follow conversion. I recommend we fix the performance issue and re-run, as the underlying feature hypothesis still looks promising."* **Never present metrics in isolation. Tell a story.**Challenge 3: Sample Ratio Mismatch (SRM) Detection
**Problem:** Your A/B test was configured for 50/50 split. After 1 week, you see: - Control: 102,340 users - Treatment: 97,660 users **Questions:** 1. Is this an SRM? (Use chi-squared test) 2. What are possible causes? 3. What should you do?✅ Solution
**1. Chi-Squared Test for SRM:** ```python from scipy.stats import chisquare observed = [102340, 97660] expected = [100000, 100000] # 50/50 of total 200,000 chi2, p_value = chisquare(observed, expected) print(f"Chi-squared: {chi2:.2f}, p-value: {p_value:.6f}") # Chi-squared: 109.5, p-value: 0.000000 ``` **Result:** p < 0.001 → **This IS a significant SRM.** The split is not random. **2. Possible Causes:** | Cause | How It Happens | |-------|----------------| | Randomization bug | Hash function not evenly distributed | | Bot filtering | Bots disproportionately in one variant | | Browser/device exclusion | Treatment feature doesn't work on Safari → those users excluded | | Redirect-based assignment | Slower treatment page → more users bounce before logging | | Experiment start time | Gradual rollout started at different times | **3. What to Do:** 1. **STOP the experiment** — results are untrustworthy 2. Investigate the assignment mechanism 3. Check if Treatment has any technical issues (crashes, redirects) 4. Look at pre-experiment data to see if groups were already different 5. Fix the issue and restart with fresh users **Rule of Thumb:** If SRM is detected, no amount of "adjusting for it" makes the results valid. The experiment is compromised.Challenge 4: The Guardrail Paradox
**Scenario:** Your experiment shows: - Primary metric (purchases): +5% (p=0.02) ✅ - Guardrail (customer service tickets): +15% (p=0.01) ❌ Leadership wants to ship because revenue is more important than support costs. **What's your recommendation and how do you communicate it?**✅ Framework
**Quantify the trade-off:** 1. Calculate incremental revenue from +5% purchases 2. Calculate incremental cost from +15% support tickets 3. Don't forget: unhappy customers churn, leave bad reviews, reduce LTV **Example Analysis:** ``` Incremental purchases: 5% × 1M purchases × $50 avg = $2.5M Incremental tickets: 15% × 50K tickets × $20 cost each = $150K Net gain: $2.35M ``` BUT also consider: - Why are tickets up? Product confusing? Quality issue? - Will ticket rate stay at +15% or grow as frustrated users accumulate? - What's the customer sentiment in those tickets? **Communication Framework:** *"I recommend we investigate before shipping. The +15% in support tickets isn't just a cost issue—it's a signal that something in the feature is confusing or broken. If we ship without understanding why, we risk:* 1. *Churn from frustrated customers* 2. *Negative reviews affecting acquisition* 3. *Scaling support costs as we grow* *Can we run a qualitative analysis on the ticket text to identify the root cause? If it's a fixable UX issue, we could ship after a quick iteration and re-test."* **Key Insight:** Guardrails exist for a reason. Don't ignore them just because the primary metric looks good.💬 Week 4 Discussion Prompts
- “Share your experiment design doc” — Get peer feedback on metric choices and potential blind spots.
- “What’s the hardest experiment decision you’ve had to make?” — Trade-off stories teach the most.
- “Post a real or hypothetical guardrail dilemma” — Practice the decision framework.
⚠️ Common Week 4 Mistakes
| Mistake | Why It’s Wrong | Fix |
|---|---|---|
| Calling the experiment early when p < 0.05 | Peeking inflates false positives | Pre-commit to sample size; use sequential testing if needed |
| Ignoring SRM | Results are meaningless with biased assignment | Always check SRM before analyzing results |
| Too many success metrics | Increases false positive rate | Pick 1-2 primary metrics; adjust for multiple testing |
| Randomizing at session level for persistent features | User sees different variants → confusion and bias | Match randomization unit to feature persistence |
✅ Week 4 Self-Assessment
Before moving to Week 5, can you:
- Write a complete experiment design document from scratch?
- Calculate and interpret lift, confidence interval, and p-value?
- Detect Sample Ratio Mismatch using chi-squared test?
- Navigate the trade-off when success metric is up but guardrail is down?
🎯 HALFWAY POINT! You’ve earned your Halfway Hero badge!
Milestone: Complete experiment write-up (design + analysis + recommendation).
Week 5: Python (Pandas & Polars) for Analysis
Goal: Confidently clean, explore, and visualize a messy dataset. New for 2026: Learn Polars as a high-performance alternative.
| Day | Focus | Do This | Resource |
|---|---|---|---|
| 29 | DataFrame Basics | Load CSV, inspect with info(), head(), describe() | Handbook: Python Analysis |
| 30 | Data Cleaning | Handle nulls, duplicates, dtypes; document your decisions | Kaggle: Data Cleaning |
| 31 | Groupby & Aggregation | Compute metrics by segment; compare to SQL approach | Pandas Docs |
| 32 | Merging & Joining | Merge 2 tables; assert key uniqueness before join | Handbook: Python Patterns |
| 33 | Visualization (Matplotlib/Seaborn) | Create bar, line, histogram, heatmap for your dataset | Seaborn Gallery |
| 34–35 | Catch-up / Mini-project | Publish an EDA notebook with insights and 3+ charts | Your GitHub |
🆕 Polars: The High-Performance Alternative (2026 Must-Know)
Polars is gaining massive traction as a faster, more memory-efficient alternative to Pandas. It uses Rust under the hood and supports lazy evaluation — essential for large datasets.
Why Learn Polars in 2026?
- 10-100x faster than Pandas for many operations
- Lazy execution — builds an optimized query plan before running
- Parallel by default — uses all CPU cores automatically
- Growing industry adoption — increasingly mentioned in job postings
Pandas vs Polars Comparison:
# ----- PANDAS -----
import pandas as pd
df_pd = pd.read_csv('data/orders.csv')
result_pd = (
df_pd[df_pd['amount'] > 50]
.groupby('customer_id')['amount']
.sum()
.reset_index()
.sort_values('amount', ascending=False)
)
# ----- POLARS -----
import polars as pl
df_pl = pl.read_csv('data/orders.csv')
result_pl = (
df_pl.lazy() # Enable lazy evaluation
.filter(pl.col('amount') > 50)
.group_by('customer_id')
.agg(pl.col('amount').sum())
.sort('amount', descending=True)
.collect() # Execute the query plan
)
Key Differences:
| Feature | Pandas | Polars |
|———|——–|——–|
| Execution | Eager (immediate) | Lazy (optimized plan) |
| Parallelism | Single-threaded | Multi-threaded by default |
| Memory | Often copies data | Zero-copy when possible |
| Syntax | .loc, .iloc, chaining | Method chaining, expressions |
| Null handling | NaN (float) | Native null type |
Week 5 Polars Challenge: Re-implement your Day 30-31 Pandas code in Polars. Compare performance on a larger dataset.
🤖 AI-Augmented Challenge (Week 5)
Refactoring with AI:
- Write your EDA code in a “quick and dirty” way (it’s okay to have spaghetti code!)
- Once it works, ask an AI: “Refactor this Pandas code into clean, modular functions with proper docstrings. Use best practices for data analysis code.”
- Critical step: Review and test the refactored code. Does it produce the same results?
- Reflection: What improvements did the AI make? What did you disagree with?
🧠 Week 5 Challenge Problems
Challenge 1: The Messy Dataset Gauntlet
**Problem:** You receive this CSV with user transactions. Clean it and answer the business questions. ```python import pandas as pd from io import StringIO data = """user_id,transaction_date,amount,category,country 1,2025-01-15,50.00,Electronics,USA 1,2025-01-15,50.00,Electronics,USA 2,01/20/2025,$75.50,Clothing,usa 3,2025-01-22,,-,Canada 4,2025-01-25,100.00,Electronics,UK 5,Feb 1 2025,25.00,Food,USA 6,2025-02-10,999999.99,Electronics,USA 1,2025-02-15,30.00,Food,United States """ df = pd.read_csv(StringIO(data)) ``` **Issues to fix:** 1. Duplicate row 2. Inconsistent date formats 3. Amount has $ and is string 4. Missing values (amount, category) 5. Country inconsistency (USA vs usa vs United States) 6. Potential outlier ($999,999) **Tasks:** 1. Clean the data 2. What's the total revenue by category (excluding the outlier)? 3. What's the revenue by country (after standardizing)?✅ Solution
```python import pandas as pd import numpy as np # 1. Remove duplicates df = df.drop_duplicates() # 2. Parse dates with multiple formats def parse_date(date_str): for fmt in ['%Y-%m-%d', '%m/%d/%Y', '%b %d %Y']: try: return pd.to_datetime(date_str, format=fmt) except: continue return pd.NaT df['transaction_date'] = df['transaction_date'].apply(parse_date) # 3. Clean amount: remove $, convert to float df['amount'] = df['amount'].replace('[\$,]', '', regex=True) df['amount'] = pd.to_numeric(df['amount'], errors='coerce') # 4. Handle missing values # - Drop rows with no amount (can't calculate revenue) # - Fill missing category with 'Unknown' df = df.dropna(subset=['amount']) df['category'] = df['category'].replace('-', np.nan).fillna('Unknown') # 5. Standardize country country_map = { 'usa': 'USA', 'united states': 'USA', 'uk': 'UK', 'canada': 'Canada' } df['country'] = df['country'].str.lower().map(country_map).fillna(df['country']) # 6. Handle outlier (IQR method or business rule) # Business rule: transactions over $10,000 are likely errors df_clean = df[df['amount'] <= 10000] # Answer questions print("Revenue by Category:") print(df_clean.groupby('category')['amount'].sum().sort_values(ascending=False)) print("\nRevenue by Country:") print(df_clean.groupby('country')['amount'].sum().sort_values(ascending=False)) ``` **Output:** ``` Revenue by Category: Electronics 150.00 Clothing 75.50 Food 55.00 Unknown 0.00 # Dropped due to NaN amount Revenue by Country: USA 205.50 UK 100.00 ``` **Key Principle:** Document every cleaning decision. Future you (or your reviewer) needs to understand why.Challenge 2: Groupby Gymnastics (Interview-Level)
**Problem:** Given user session data, calculate for each user: 1. Total sessions 2. Average session duration 3. Days since first session 4. Most common device type 5. Conversion rate (sessions with purchase / total sessions) ```python sessions = pd.DataFrame({ 'user_id': [1,1,1,2,2,3], 'session_date': pd.to_datetime(['2025-01-01','2025-01-05','2025-01-10', '2025-01-02','2025-01-08','2025-01-03']), 'duration_seconds': [300, 450, 200, 600, 300, 150], 'device': ['mobile','mobile','desktop','desktop','desktop','mobile'], 'purchased': [False, True, False, True, True, False] }) ```✅ Solution
```python today = pd.to_datetime('2025-01-15') def most_common(series): """Return most common value, handling ties.""" return series.mode().iloc[0] if len(series.mode()) > 0 else None user_summary = sessions.groupby('user_id').agg( total_sessions=('session_date', 'count'), avg_duration_sec=('duration_seconds', 'mean'), first_session=('session_date', 'min'), most_common_device=('device', most_common), sessions_with_purchase=('purchased', 'sum'), total_sessions_for_cvr=('purchased', 'count') ).reset_index() # Calculate derived fields user_summary['days_since_first'] = (today - user_summary['first_session']).dt.days user_summary['conversion_rate'] = ( user_summary['sessions_with_purchase'] / user_summary['total_sessions_for_cvr'] ) # Clean up user_summary = user_summary.drop(columns=['first_session', 'total_sessions_for_cvr']) print(user_summary.to_string()) ``` **Output:** ``` user_id total_sessions avg_duration_sec most_common_device sessions_with_purchase days_since_first conversion_rate 0 1 3 316.666667 mobile 1 14 0.333333 1 2 2 450.000000 desktop 2 13 1.000000 2 3 1 150.000000 mobile 0 12 0.000000 ``` **Pandas Pro Tips:** - Use `.agg()` with named aggregations for clarity - Create custom aggregation functions for complex logic - Calculate ratios after the groupby, not inside itChallenge 3: Merge Madness (Fanout Detection)
**Problem:** You're merging users and orders. Predict what happens and how to prevent issues. ```python users = pd.DataFrame({ 'user_id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Carol'] }) orders = pd.DataFrame({ 'order_id': [101, 102, 103, 104, 105], 'user_id': [1, 1, 2, 2, 4], # Note: user 4 doesn't exist in users 'amount': [50, 30, 75, 25, 100] }) ``` **Questions:** 1. What's the row count after `users.merge(orders)`? (default inner join) 2. What's the row count after `users.merge(orders, how='left')`? 3. How do you calculate revenue per user without row fanout affecting your calculation?✅ Solution
```python # Question 1: Inner join inner = users.merge(orders) print(f"Inner join rows: {len(inner)}") # 4 rows (user 1's 2 orders + user 2's 2 orders) # Question 2: Left join left = users.merge(orders, how='left') print(f"Left join rows: {len(left)}") # 5 rows (includes Carol with NaN order) # Question 3: Calculate revenue per user WITHOUT fanout # WRONG: merge then groupby (works but wastes memory on large data) # RIGHT: aggregate first, then merge # Method 1: Aggregate orders first user_revenue = orders.groupby('user_id')['amount'].sum().reset_index() user_revenue.columns = ['user_id', 'total_revenue'] result = users.merge(user_revenue, how='left') result['total_revenue'] = result['total_revenue'].fillna(0) print(result) # Method 2: Use .map() for simple lookups revenue_lookup = orders.groupby('user_id')['amount'].sum() users['total_revenue'] = users['user_id'].map(revenue_lookup).fillna(0) ``` **Fanout Prevention Rule:** > Always aggregate the "many" side before merging to the "one" side. **Verification Pattern:** ```python # Before merge print(f"Users: {len(users)}, Orders: {len(orders)}") # After merge merged = users.merge(orders) print(f"Merged: {len(merged)}") # Should equal len(orders) for inner, ≥ len(users) for left # Assert no unexpected fanout expected_rows = len(orders) # for inner join on user_id assert len(merged) == expected_rows, f"Fanout detected: {len(merged)} vs expected {expected_rows}" ```Challenge 4: Visualization Best Practices
**Problem:** Critique these common visualization mistakes and fix them. **Mistake 1: Pie chart for 10 categories** ```python # Bad df.groupby('category')['sales'].sum().plot.pie() ``` **Mistake 2: Dual Y-axes with different scales** ```python # Misleading fig, ax1 = plt.subplots() ax1.plot(df['date'], df['revenue']) # Scale: 0-10M ax2 = ax1.twinx() ax2.plot(df['date'], df['users']) # Scale: 0-1000 ``` **Mistake 3: No context for bar chart** ```python # What's good? What's bad? df.plot.bar(x='region', y='conversion_rate') ```✅ Better Approaches
**Fix 1: Use horizontal bar chart for many categories** ```python import matplotlib.pyplot as plt import seaborn as sns # Sort and limit to top categories top_cats = df.groupby('category')['sales'].sum().nlargest(8) top_cats.sort_values().plot.barh(color='steelblue') plt.xlabel('Sales ($)') plt.title('Top 8 Categories by Sales') plt.tight_layout() ``` **Fix 2: Index both series or use separate charts** ```python # Option A: Index to same baseline fig, ax = plt.subplots() df['revenue_indexed'] = df['revenue'] / df['revenue'].iloc[0] * 100 df['users_indexed'] = df['users'] / df['users'].iloc[0] * 100 ax.plot(df['date'], df['revenue_indexed'], label='Revenue (indexed)') ax.plot(df['date'], df['users_indexed'], label='Users (indexed)') ax.set_ylabel('Index (Day 1 = 100)') ax.legend() # Option B: Small multiples (two separate charts) fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8), sharex=True) ax1.plot(df['date'], df['revenue']) ax1.set_ylabel('Revenue ($)') ax2.plot(df['date'], df['users']) ax2.set_ylabel('Users') ``` **Fix 3: Add benchmark/context** ```python fig, ax = plt.subplots(figsize=(10, 6)) # Add benchmark line benchmark = 0.05 # 5% company average bars = ax.bar(df['region'], df['conversion_rate'], color='steelblue') ax.axhline(y=benchmark, color='red', linestyle='--', label=f'Company Avg ({benchmark:.0%})') # Color bars by performance for bar, val in zip(bars, df['conversion_rate']): if val < benchmark: bar.set_color('coral') ax.set_ylabel('Conversion Rate') ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:.0%}')) ax.legend() plt.title('Conversion Rate by Region vs. Company Average') ```💬 Week 5 Discussion Prompts
- “Share your messiest data cleaning challenge” — What issues did you find and how did you solve them?
- “Post your EDA notebook” — Get feedback on structure, visualization choices, and insights.
- “Pandas vs SQL: which do you prefer for [task]?” — Compare approaches for groupby, joins, window functions.
⚠️ Common Week 5 Mistakes
| Mistake | Why It’s Wrong | Fix |
|---|---|---|
df['col'] = df['col'].fillna(0) on original |
SettingWithCopyWarning; may not persist | Use df.loc[:, 'col'] = ... or .copy() first |
| Merging without checking key uniqueness | Silent fanout multiplies your rows | assert df['key'].is_unique before merge |
Using apply() for vectorizable operations |
10-100x slower than vectorized | Use built-in pandas/numpy operations |
| Not resetting index after groupby | Index becomes the grouped column | Chain .reset_index() for clean DataFrame |
✅ Week 5 Self-Assessment
Before moving to Week 6, can you:
- Load a CSV and immediately identify its data quality issues?
- Write a groupby with multiple aggregations including custom functions?
- Merge two tables and verify no unexpected row fanout occurred?
- Create a publication-quality chart with title, labels, and context?
Milestone: EDA notebook with documented cleaning + visualizations.
Week 6: Product Metrics & Case Studies
Goal: Think like a product analyst; answer “why” and “what next.”
| Day | Focus | Do This | Resource |
|---|---|---|---|
| 36 | Metrics Frameworks (HEART, AARRR) | Apply one framework to a product you use daily | HEART Framework |
| 37 | North Star & Input Metrics | Define for a hypothetical feature launch | Own notes |
| 38 | Case Study Walk-Through | Crack one open-ended case: “DAU dropped 5%—investigate” | Handbook: Analytical Execution |
| 39 | Case Study Walk-Through | Crack another: “Should we launch X feature internationally?” | Exponent Course |
| 40 | Case Study Walk-Through | Crack a metrics definition case | StrataScratch |
| 41–42 | Catch-up / Mini-project | Write a 1-page case study answer doc with structure | Your GitHub |
🆕 Metric Trees (Hierarchical Metric Visualization)
Metric Trees help you visualize how metrics relate to each other hierarchically. This is essential for:
- Understanding which input metrics drive your North Star
- Diagnosing drops (which branch of the tree is affected?)
- Communicating metric strategy to stakeholders
Example: E-Commerce Revenue Metric Tree
┌─────────────────────────────────────┐
│ NORTH STAR: Revenue │
│ = Orders × AOV │
└─────────────────────────────────────┘
│
┌─────────────────────────┴─────────────────────────┐
▼ ▼
┌───────────────────┐ ┌───────────────────┐
│ Orders │ │ Average Order Val │
│ = Visitors × CVR │ │ = Revenue/Orders│
└───────────────────┘ └───────────────────┘
│ │
┌───────┴───────┐ ┌───────┴───────┐
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Visitors │ │ CVR │ │Units per│ │Unit │
│ │ │ │ │ Order │ │ Price │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│Organic │ │Add-to-Cart │
│Paid │ │Cart→Checkout│
│Direct │ │Checkout→Pay │
└─────────────┘ └─────────────┘
How to Use Metric Trees:
- Start at the top with your North Star (Revenue, DAU, etc.)
- Decompose mathematically — use multiplications and additions
- Go 3-4 levels deep until you reach actionable levers
- Color-code for diagnosis — red for declining, green for improving
Tools for Creating Metric Trees:
- Miro — drag-and-drop diagrams
- Excalidraw — hand-drawn style, great for presentations
- Whimsical — clean flowcharts
🤖 AI-Augmented Challenge (Week 6): Simulated Stakeholder Bot
Practice Case Study Defense:
- Write your case study answer for “DAU dropped 5%”
- Submit it to ChatGPT with this prompt:
“You are a skeptical VP of Product. I’m presenting my analysis of why DAU dropped 5%. Challenge my findings with tough questions like: ‘But how do you know it’s not just seasonality?’ or ‘What’s the confidence level on that conclusion?’ Be specific and push back on weak logic.”
- Defend your analysis in a back-and-forth conversation
- Reflection: What gaps in your analysis did the “VP” expose?
🧠 Week 6 Challenge Problems
Challenge 1: DAU Dropped 5% — Full Investigation Framework
**Scenario:** You're the data scientist at a social media app. The CEO pings you at 9 AM: *"DAU is down 5% compared to last week. What's happening?"* **Your Task:** Walk through a complete investigation framework.✅ Complete Framework
### Step 1: Clarify & Don't Panic (2 minutes) **Questions to ask:** - Is this 5% week-over-week or compared to the same day last week? - Is this statistically significant given normal variance? - When did it start? (Today? Gradual over the week?) **Quick sanity checks:** - Is it a holiday or unusual day? - Any major external events? (competitor launch, news) - Were there any product deployments? ### Step 2: Decomposition Framework (MECE Breakdown) **Split DAU into components:** ``` DAU = New Users + Returning Users Or: DAU = Σ (Users by Platform) = iOS + Android + Web Or: DAU = Σ (Users by Country) = US + UK + India + ... Or: DAU = Σ (Users by Acquisition Source) = Organic + Paid + Referral ``` **Check each slice:** ```sql SELECT activity_date, user_type, -- 'new' vs 'returning' COUNT(DISTINCT user_id) AS dau FROM daily_activity WHERE activity_date BETWEEN DATE_SUB(CURRENT_DATE, INTERVAL 14 DAY) AND CURRENT_DATE GROUP BY 1, 2 ORDER BY 1, 2; ``` ### Step 3: Hypothesis Tree | Hypothesis | How to Check | What Would Confirm It | |------------|--------------|----------------------| | **Supply-side: Less content** | Posts created per day | Down more than DAU | | **Demand-side: Less engagement** | Sessions per user | Flat or up (not the issue) | | **Technical: App issues** | Crash rate, error logs | Spike correlates with DAU drop | | **External: Competition** | App store rankings, competitor news | Competitor launched/promoted | | **Seasonal: Natural variance** | Same period last year | Similar pattern historically | | **Acquisition: Fewer new users** | New signups, ad spend | Marketing budget cut or CAC spike | | **Retention: Users leaving** | D1, D7 retention rates | Drop in recent cohorts | ### Step 4: Drill Down (Example Findings) **Finding:** The drop is entirely in Android, specifically in India. **Next questions:** - Was there an Android-only deployment? - Did Google Play feature placement change? - Any carrier/ISP issues in India? - Was there an Android update that broke something? ### Step 5: Structure Your Answer **For the CEO (30-second version):** *"DAU is down 5% driven entirely by Android users in India. We deployed an update Tuesday that increased app size by 40MB, which is causing download/update failures on lower-end devices. Engineering is rolling back now; we expect recovery in 24-48 hours."* **For a detailed writeup:** 1. Summary of impact (quantify) 2. Root cause with supporting data 3. Immediate actions taken 4. Longer-term prevention steps 5. How we'll monitor recoveryChallenge 2: Define Metrics for a New Feature (Product Sense)
**Scenario:** Instagram is launching "Close Friends Story Reactions" — when you react to a close friend's story, they see a special animation. You're asked to define success metrics. **Your Task:** Define a complete metrics framework.✅ Complete Metrics Framework
### North Star Metric **Close Friends Interactions per User per Week** Why: This measures the core behavior we're trying to drive — deeper connections with close friends. ### Primary Metrics (Success Criteria) | Metric | Definition | Target | Rationale | |--------|------------|--------|-----------| | Close Friends Reaction Rate | (Reactions to CF stories) / (Views of CF stories) | +10% vs baseline | Direct feature usage | | CF Story Replies | DMs sent in response to CF stories | +5% | Reactions should spark conversation | ### Secondary Metrics (Mechanism Validation) | Metric | What It Tells Us | |--------|------------------| | CF Story Views per User | Are people watching more CF content? | | Time to First Reaction | Is the new animation discoverable? | | Reaction Type Distribution | Which animations are most popular? | | CF List Size | Are users adding more close friends? | ### Guardrail Metrics (Watch for Harm) | Guardrail | Concern | Threshold | |-----------|---------|-----------| | Total Story Views | CF reactions cannibalize regular stories | No decrease | | DAU | Feature is annoying/spammy | No decrease | | CF Removal Rate | Users remove friends who over-react | No increase | | Story Creation Rate | Creators feel pressured by reactions | No decrease | ### Segmentation Plan Analyze all metrics by: - New vs returning users - Heavy vs light CF users - iOS vs Android - Market (US, India, Brazil, etc.) ### Counter-metrics to Watch **If Close Friends reactions go UP but these go DOWN, we may have a problem:** - Regular story reactions (cannibalization) - DM conversations (reactions replace actual conversation) - Close Friends story creation (creators feel pressure)Challenge 3: Launch Decision Framework
**Scenario:** You ran a test for a new checkout flow. Results: | Metric | Control | Treatment | Lift | p-value | |--------|---------|-----------|------|---------| | Conversion Rate | 3.2% | 3.4% | +6.2% | 0.03 | | Revenue per Visitor | $4.50 | $4.45 | -1.1% | 0.42 | | Customer Service Tickets | 0.8% | 1.2% | +50% | 0.001 | | Page Load Time | 2.1s | 2.3s | +10% | 0.02 | **Question:** Should we launch? Present your recommendation.✅ Structured Recommendation
### Executive Summary: **Do Not Launch (Yet)** ### Analysis **The Good:** - Conversion rate increased 6.2% (statistically significant) - This would translate to ~$X additional revenue per month **The Concerning:** - Revenue per visitor is flat/slightly down (not significant, but directionally negative) - **Customer service tickets up 50%** — this is a major red flag - Page load time increased 10% — impacts SEO and user experience ### Diagnosis The conversion increase paired with flat revenue and higher support tickets suggests: - More users are completing checkout, but... - They may be confused about what they purchased, OR - They're encountering issues post-purchase, OR - The new flow attracts lower-value conversions ### Recommendation 1. **Investigate support ticket content** — What are users complaining about? 2. **Analyze conversion quality** — Are these incremental converters or just shifted from returning later? 3. **Fix page load regression** — This is likely addressable 4. **Consider partial rollout** — Launch to power users first, gather qualitative feedback ### Decision Framework | Scenario | Action | |----------|--------| | Tickets are about one fixable issue | Fix it, re-run for 1 week | | Tickets show fundamental confusion | Redesign UX, full re-test | | Tickets are about external factors | Address, then proceed with launch | | Revenue per visitor drops significantly | Kill the test | ### Communication *"The test showed promising conversion gains, but support ticket volume increased 50%. Before launching, I recommend we analyze ticket content to understand the root cause. If it's a fixable UX issue, we can iterate quickly and capture the conversion gains without degrading support costs."*Challenge 4: "Metric Went Up — Is It Good?" (Nuance Exercise)
**For each scenario, is the metric increase good, bad, or "it depends"?** | Metric | Change | Good/Bad/Depends? | |--------|--------|-------------------| | Time spent in app | +15% | ? | | Customer service ticket resolution time | +20% | ? | | Number of A/B tests run | +50% | ? | | DAU / MAU ratio | +10% | ? | | Signup conversion rate | +25% | ? |✅ Answers with Reasoning
| Metric | Change | Verdict | Reasoning | |--------|--------|---------|-----------| | Time spent in app | +15% | **Depends** | Good if users are engaged. Bad if they're confused/stuck. Check if it correlates with satisfaction/NPS. | | CS ticket resolution time | +20% | **Bad** | Longer resolution = worse customer experience. Unless tickets are more complex (need more data). | | Number of A/B tests run | +50% | **Depends** | Good if test quality is maintained. Bad if it's "test more" without learning more. | | DAU/MAU ratio | +10% | **Good** | Higher stickiness — users engage more frequently. Classic engagement metric. | | Signup conversion | +25% | **Depends** | Good if quality maintained. Check D1, D7 retention of new users. May have lowered friction too much (low-quality signups). | **Meta-lesson:** Almost every metric is "it depends." Good analysts ask: "What behavior does this metric incentivize? What would game it?"💬 Week 6 Discussion Prompts
- “Share your ‘DAU dropped’ investigation” — Post your decomposition approach.
- “What’s the most misleading metric you’ve encountered?” — Discuss Goodhart’s Law in practice.
- “Present your case study write-up” — Get feedback on structure and communication.
⚠️ Common Week 6 Mistakes
| Mistake | Why It’s Wrong | Fix |
|---|---|---|
| Jumping to solutions before diagnosing | You might solve the wrong problem | Always decompose first (MECE) |
| Only looking at one metric | Miss cannibalization, side effects | Always check guardrails and counter-metrics |
| Assuming more = better | Time spent ≠ value; clicks ≠ satisfaction | Understand the behavior behind the metric |
| Not quantifying impact | “It’s down” vs “It’s down $50K/day” | Always translate to business impact |
✅ Week 6 Self-Assessment
Before moving to Week 7, can you:
- Decompose a metric drop into MECE components?
- Define a North Star metric with supporting primary, secondary, and guardrail metrics?
- Structure a go/no-go launch recommendation with data?
- Identify when a metric increase might actually be bad?
Milestone: 3 case studies solved; 1 written doc.
Week 7: Analytical Engineering Basics
Goal: Understand how analysts and AEs model data for reliability. New for 2026: Get hands-on with dbt Core.
| Day | Focus | Do This | Resource |
|---|---|---|---|
| 43 | Raw → Staging → Mart Layers | Read the AE index page; sketch layers for an orders pipeline | Handbook: Analytical Engineering |
| 44 | De-duplication Patterns | Write a ROW_NUMBER dedupe query; explain tie-breakers | Handbook: Advanced SQL |
| 45 | Idempotent Loads (ON CONFLICT) | Write an upsert in Postgres | Handbook: Advanced SQL |
| 46 | Data Quality Checks | Write 3 test queries (uniqueness, not-null, referential) | Handbook: AE Checks |
| 47 | Date Spines & Completeness | Generate a date spine; LEFT JOIN to fill gaps | Handbook: Date Spine |
| 48–49 | Catch-up / Mini-project | Commit a small pipeline (staging view + mart table + test queries) | Your GitHub |
🆕 dbt Core Tutorial (Industry-Standard Transformation Tool)
dbt (data build tool) is the industry standard for transforming data in the warehouse. Even if you don’t become an Analytics Engineer, understanding dbt is valuable for:
- Reading and contributing to data models at your company
- Understanding how modern data teams operate
- Speaking the same language as your AE counterparts
Quick Start with dbt + DuckDB (No Cloud Required):
# Install dbt with DuckDB adapter
pip install dbt-core dbt-duckdb
# Initialize a new dbt project
dbt init my_analytics_project
cd my_analytics_project
Project Structure:
my_analytics_project/
├── models/
│ ├── staging/ # Clean, rename, dedupe raw data
│ │ ├── stg_orders.sql
│ │ ├── stg_products.sql
│ │ └── _staging.yml # Schema tests
│ ├── intermediate/ # Business logic, joins
│ │ └── int_order_items_enriched.sql
│ └── marts/ # Final aggregations
│ └── mart_daily_revenue.sql
├── tests/ # Custom data tests
├── macros/ # Reusable SQL snippets
├── dbt_project.yml # Project configuration
└── profiles.yml # Database connections
Example Staging Model (models/staging/stg_orders.sql):
-- stg_orders.sql
-- This model deduplicates raw orders and standardizes column types
with source as (
select * from
),
deduplicated as (
select
*,
row_number() over (
partition by order_id
order by _loaded_at desc
) as row_num
from source
)
select
order_id,
customer_id,
order_date::date as order_date,
amount::numeric(12,2) as amount,
status
from deduplicated
where row_num = 1
and status != 'cancelled'
Add Tests (models/staging/_staging.yml):
version: 2
models:
- name: stg_orders
description: "Cleaned and deduplicated orders"
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- name: amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 100000
Run dbt:
# Build all models
dbt run
# Run tests
dbt test
# Generate documentation
dbt docs generate
dbt docs serve # Opens in browser at localhost:8080
🤖 AI-Augmented Challenge (Week 7)
dbt Model Generation with AI:
- Describe your data source to ChatGPT: “I have a raw_orders table with columns (order_id, customer_id, order_date, amount, status, _loaded_at). Generate a dbt staging model that deduplicates by order_id, filters cancelled orders, and casts types appropriately.”
- Review the generated SQL — does it follow dbt best practices?
- Add tests and run
dbt testto validate - Reflection: What did the AI get right? What did you need to customize?
🧠 Week 7 Challenge Problems
Challenge 1: Design a Data Pipeline (Whiteboard Exercise)
**Scenario:** You have three source tables from your e-commerce platform: - `raw_orders` — every order placed (may have duplicates from retry logic) - `raw_order_items` — line items within orders - `raw_products` — product catalog **Design a pipeline to create a `mart_daily_revenue` table showing daily revenue by product category.**✅ Complete Pipeline Design
### Layer Architecture ``` RAW (source of truth, append-only) ├── raw_orders ├── raw_order_items └── raw_products ↓ STAGING (cleaned, deduplicated, typed) ├── stg_orders ├── stg_order_items └── stg_products ↓ INTERMEDIATE (business logic applied) ├── int_order_items_enriched (joined with product info) ↓ MARTS (aggregated for consumption) └── mart_daily_revenue ``` ### SQL Implementation **Step 1: Staging — Deduplicate Orders** ```sql CREATE OR REPLACE VIEW stg_orders AS WITH deduplicated AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY order_id ORDER BY _loaded_at DESC -- Keep most recent if duplicated ) AS rn FROM raw_orders WHERE order_status != 'cancelled' -- Exclude cancelled ) SELECT order_id, customer_id, order_date::date AS order_date, order_total::numeric(12,2) AS order_total, _loaded_at FROM deduplicated WHERE rn = 1; ``` **Step 2: Staging — Clean Order Items** ```sql CREATE OR REPLACE VIEW stg_order_items AS SELECT order_item_id, order_id, product_id, quantity::int AS quantity, unit_price::numeric(10,2) AS unit_price, (quantity * unit_price)::numeric(12,2) AS line_total FROM raw_order_items WHERE quantity > 0 -- Exclude invalid records AND unit_price > 0; ``` **Step 3: Staging — Products** ```sql CREATE OR REPLACE VIEW stg_products AS SELECT product_id, product_name, COALESCE(category, 'Uncategorized') AS category, COALESCE(subcategory, 'Other') AS subcategory FROM raw_products; ``` **Step 4: Intermediate — Enrich Order Items** ```sql CREATE OR REPLACE VIEW int_order_items_enriched AS SELECT oi.order_item_id, oi.order_id, o.order_date, o.customer_id, oi.product_id, p.product_name, p.category, p.subcategory, oi.quantity, oi.unit_price, oi.line_total FROM stg_order_items oi JOIN stg_orders o ON oi.order_id = o.order_id JOIN stg_products p ON oi.product_id = p.product_id; ``` **Step 5: Mart — Daily Revenue** ```sql CREATE TABLE mart_daily_revenue AS WITH date_spine AS ( SELECT generate_series( (SELECT MIN(order_date) FROM stg_orders), CURRENT_DATE, '1 day'::interval )::date AS dt ), revenue_by_day_category AS ( SELECT order_date, category, SUM(line_total) AS revenue, COUNT(DISTINCT order_id) AS orders, SUM(quantity) AS units_sold FROM int_order_items_enriched GROUP BY order_date, category ) SELECT ds.dt AS date, COALESCE(r.category, 'No Sales') AS category, COALESCE(r.revenue, 0) AS revenue, COALESCE(r.orders, 0) AS orders, COALESCE(r.units_sold, 0) AS units_sold FROM date_spine ds LEFT JOIN revenue_by_day_category r ON ds.dt = r.order_date ORDER BY ds.dt, r.category; ``` ### Why This Design? | Decision | Reason | |----------|--------| | Views for staging | Easy to update; no data duplication | | Table for mart | Performance for dashboard queries | | Date spine | Ensures no gaps for time series charts | | Explicit COALESCE | Documents handling of missing data | | ROW_NUMBER for dedupe | Deterministic (vs DISTINCT which is arbitrary) |Challenge 2: Write Data Quality Tests
**Problem:** Write SQL tests that would catch these data issues: 1. Duplicate order IDs 2. NULL customer IDs on orders 3. Orders referencing non-existent products 4. Negative quantities 5. Orders dated in the future✅ Test Queries
```sql -- Test 1: No duplicate order IDs in staging -- Expected: 0 rows SELECT order_id, COUNT(*) AS duplicates FROM stg_orders GROUP BY order_id HAVING COUNT(*) > 1; -- Test 2: No NULL customer IDs -- Expected: 0 rows SELECT COUNT(*) AS null_customers FROM stg_orders WHERE customer_id IS NULL; -- Test 3: Referential integrity — all products exist -- Expected: 0 rows SELECT oi.product_id, COUNT(*) AS orphan_count FROM stg_order_items oi LEFT JOIN stg_products p ON oi.product_id = p.product_id WHERE p.product_id IS NULL GROUP BY oi.product_id; -- Test 4: No negative quantities -- Expected: 0 rows SELECT COUNT(*) AS negative_qty FROM stg_order_items WHERE quantity < 0; -- Test 5: No future orders -- Expected: 0 rows SELECT COUNT(*) AS future_orders FROM stg_orders WHERE order_date > CURRENT_DATE; -- Bonus: Aggregate test — revenue should match -- Check that mart totals equal source totals WITH mart_total AS ( SELECT SUM(revenue) AS mart_revenue FROM mart_daily_revenue ), source_total AS ( SELECT SUM(line_total) AS source_revenue FROM int_order_items_enriched ) SELECT m.mart_revenue, s.source_revenue, ABS(m.mart_revenue - s.source_revenue) AS difference FROM mart_total m, source_total s WHERE ABS(m.mart_revenue - s.source_revenue) > 0.01; -- Tolerance for rounding ``` ### dbt-Style Test YAML If using dbt, these translate to: ```yaml models: - name: stg_orders columns: - name: order_id tests: - unique - not_null - name: customer_id tests: - not_null - name: order_date tests: - not_null - dbt_utils.accepted_range: max_value: "current_date" - name: stg_order_items columns: - name: product_id tests: - relationships: to: ref('stg_products') field: product_id - name: quantity tests: - dbt_utils.accepted_range: min_value: 0 ```Challenge 3: Handle Late-Arriving Data
**Problem:** Your pipeline runs daily at 6 AM. But some orders from yesterday arrive at 7 AM (late-arriving data). How do you handle this? **Approaches to consider:** 1. Re-run the entire pipeline 2. Use incremental processing 3. Use event time vs processing time✅ Solution Patterns
### Pattern 1: Full Refresh (Simple but Expensive) ```sql -- Just rebuild the whole mart TRUNCATE mart_daily_revenue; INSERT INTO mart_daily_revenue SELECT ... FROM int_order_items_enriched ...; ``` **Pros:** Simple, always correct **Cons:** Slow for large tables; wasteful ### Pattern 2: Incremental with Lookback Window ```sql -- Delete and replace recent data (e.g., last 3 days) DELETE FROM mart_daily_revenue WHERE date >= CURRENT_DATE - INTERVAL '3 days'; INSERT INTO mart_daily_revenue SELECT ... FROM int_order_items_enriched WHERE order_date >= CURRENT_DATE - INTERVAL '3 days'; ``` **Pros:** Handles late-arriving data within window **Cons:** Need to choose appropriate window size ### Pattern 3: Upsert (Postgres ON CONFLICT) ```sql -- Add unique constraint ALTER TABLE mart_daily_revenue ADD CONSTRAINT unique_date_category UNIQUE (date, category); -- Upsert INSERT INTO mart_daily_revenue (date, category, revenue, orders, units_sold) SELECT order_date, category, SUM(line_total), COUNT(DISTINCT order_id), SUM(quantity) FROM int_order_items_enriched WHERE order_date >= CURRENT_DATE - INTERVAL '3 days' GROUP BY order_date, category ON CONFLICT (date, category) DO UPDATE SET revenue = EXCLUDED.revenue, orders = EXCLUDED.orders, units_sold = EXCLUDED.units_sold, updated_at = NOW(); ``` **Pros:** Only touches changed rows; idempotent **Cons:** Requires unique constraint; more complex ### Pattern 4: Event Time Partitioning ```sql -- Partition by order_date (event time), not load_date (processing time) CREATE TABLE mart_daily_revenue ( date DATE, category TEXT, revenue NUMERIC, ... ) PARTITION BY RANGE (date); -- Replace only affected partitions ALTER TABLE mart_daily_revenue DETACH PARTITION mart_daily_revenue_2025_12; -- Rebuild that month CREATE TABLE mart_daily_revenue_2025_12 AS ...; ALTER TABLE mart_daily_revenue ATTACH PARTITION mart_daily_revenue_2025_12 FOR VALUES FROM ('2025-12-01') TO ('2026-01-01'); ``` **Pros:** Efficient for large tables; atomic partition swaps **Cons:** Operational complexity ### Decision Matrix | Data Volume | Lateness Tolerance | Recommended Pattern | |-------------|-------------------|---------------------| | Small (<1M rows) | Hours | Full refresh | | Medium | Days | Incremental + lookback | | Large | Days | Upsert or partitioning | | Very large | Weeks | Partitioning + selective rebuild |Challenge 4: Debug a Broken Pipeline
**Scenario:** Your daily revenue mart is showing $0 for yesterday. The source tables have data. What do you check?✅ Debugging Checklist
### Immediate Checks (< 5 minutes) ```sql -- 1. Is there data in source for yesterday? SELECT COUNT(*) FROM raw_orders WHERE order_date::date = CURRENT_DATE - 1; -- 2. Did staging dedupe remove everything? SELECT COUNT(*) FROM stg_orders WHERE order_date = CURRENT_DATE - 1; -- 3. Is there a JOIN failure (orphan records)? SELECT COUNT(*) FROM stg_order_items oi LEFT JOIN stg_orders o ON oi.order_id = o.order_id WHERE o.order_id IS NULL; -- 4. Date/timezone issue? SELECT DISTINCT order_date, order_date AT TIME ZONE 'UTC' AT TIME ZONE 'America/Los_Angeles' AS pst_date FROM stg_orders ORDER BY order_date DESC LIMIT 10; -- 5. Was the mart actually updated? SELECT MAX(updated_at) FROM mart_daily_revenue; ``` ### Common Root Causes | Symptom | Likely Cause | Fix | |---------|--------------|-----| | Raw has data, staging is empty | WHERE clause too restrictive | Check staging view filters | | Staging has data, mart is 0 | JOIN condition mismatch | Check key column types/values | | Only yesterday is 0 | Pipeline didn't run | Check scheduler logs | | All dates are wrong | Timezone conversion issue | Ensure consistent timezone handling | | Partial data missing | Late-arriving data | Extend lookback window | ### Prevention: Add Pipeline Monitoring ```sql -- Freshness check SELECT MAX(order_date) AS latest_order, CURRENT_DATE - MAX(order_date) AS days_behind FROM mart_daily_revenue; -- Expect: days_behind should be 0 or 1 -- Volume anomaly check SELECT date, revenue, AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING) AS trailing_avg, revenue / NULLIF(AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING), 0) AS ratio FROM mart_daily_revenue WHERE date >= CURRENT_DATE - 14 ORDER BY date DESC; -- Expect: ratio should be between 0.5 and 2.0 (alert if outside) ```💬 Week 7 Discussion Prompts
- “Share your staging → mart pipeline” — Get feedback on design decisions.
- “What’s the worst data quality bug you’ve encountered?” — Learn from real-world failures.
- “dbt vs raw SQL: what’s your preference?” — Discuss tooling trade-offs.
⚠️ Common Week 7 Mistakes
| Mistake | Why It’s Wrong | Fix |
|---|---|---|
| Not testing for referential integrity | Silent data loss on JOINs | Add LEFT JOIN checks; count orphan records |
| Using SELECT * in views | Schema changes break downstream | Explicitly list columns |
| No date spine | Gaps in time series break charts | Always LEFT JOIN from a complete date range |
| Hardcoded dates | Pipeline breaks tomorrow | Use CURRENT_DATE and relative intervals |
✅ Week 7 Self-Assessment
Before moving to Week 8, can you:
- Design a raw → staging → mart pipeline on a whiteboard?
- Write a ROW_NUMBER dedupe query with explicit tie-breakers?
- Create data quality tests for uniqueness, nulls, and referential integrity?
- Handle late-arriving data with an incremental load pattern?
Milestone: Working staging → mart pipeline with tests.
Week 8: Capstone & Storytelling
Goal: Synthesize everything into a portfolio-quality deliverable. New for 2026: Required 3-minute video walkthrough.
| Day | Focus | Do This | Resource |
|---|---|---|---|
| 50 | Choose Capstone Topic | Pick a question you can answer with a public dataset | Kaggle, Census, Google Trends |
| 51 | Data Prep & EDA | Clean and explore; note 3 hypotheses | Your notebook |
| 52 | Analysis & Insight | Run stats/SQL/Python; validate or reject hypotheses | Your notebook |
| 53 | Build Presentation | Create 5 slides: Problem → Approach → Key Insight → Recommendation → Next Steps | Google Slides / Canva |
| 54 | Record Video Walkthrough | Record a 3-minute Loom explaining your project (required!) | Loom |
| 55–56 | Polish & Publish | Finalize notebook + slides + video; post on LinkedIn/GitHub | Your portfolio |
🆕 Video Walkthrough Requirement (Communication is 50% of the Job)
Why This Matters:
- In real jobs, you’ll present findings to stakeholders who won’t read your notebook
- Video communication is increasingly expected (remote work, async updates)
- This is your chance to practice explaining technical work to non-technical audiences
3-Minute Video Structure:
- 0:00-0:30 — The Hook: What question did you answer? Why does it matter?
- 0:30-1:30 — The Method: How did you approach it? (Keep it high-level, avoid code details)
- 1:30-2:30 — The Insight: What did you find? Show your key visualization.
- 2:30-3:00 — The Recommendation: What should the business do? What’s the next step?
Tools for Recording:
- Loom — Free, easy screen + camera recording
- OBS Studio — Free, more control (local recording)
- Zoom — Record yourself presenting
Pro Tips:
- Write a script first (or at least bullet points)
- Practice once before recording
- It’s okay to do multiple takes!
- Include your face in the corner — builds trust
🤖 AI-Augmented Challenge (Week 8)
Presentation Critique with AI:
- Write your 5-slide presentation outline
- Submit it to ChatGPT: “You are a presentation coach. Critique this data presentation structure. What’s confusing? What’s missing? How can I make the story more compelling?”
- Revise based on feedback
- Reflection: What storytelling improvements did the AI suggest?
🎯 Week 8 Capstone Framework
Your capstone should demonstrate every skill you’ve built:
| Week | Skill to Showcase | How to Demonstrate in Capstone |
|---|---|---|
| 1-2 | SQL mastery | Include a complex query (CTEs, window functions) |
| 3 | Statistics | Include a statistical test with CI |
| 4 | A/B testing mindset | Frame a “next steps” experiment |
| 5 | Python/Pandas (or Polars!) | Show clean data pipeline code |
| 6 | Product metrics | Track a North Star metric with metric tree |
| 7 | AE fundamentals | Document your data transformation logic |
| 8 | Communication | 3-minute video + clear slides |
🧠 Week 8 Challenge: Example Capstone Project
Full Capstone Example: "What Drives Airbnb Pricing in NYC?"
### 📋 Project Overview **Question:** What factors most influence Airbnb listing prices in NYC? **Dataset:** [Inside Airbnb NYC](http://insideairbnb.com/get-the-data/) **Skills Demonstrated:** - SQL (data exploration) - Statistics (hypothesis testing) - Python (EDA, visualization) - Product thinking (recommendations) - Storytelling (presentation) --- ### 🔍 Part 1: Data Exploration (SQL) ```sql -- Load data into Postgres and explore -- (This would be in a staging layer in a real pipeline) -- Key stats by neighborhood WITH neighborhood_stats AS ( SELECT neighbourhood_group AS borough, neighbourhood, COUNT(*) AS listings, ROUND(AVG(price), 2) AS avg_price, ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY price), 2) AS median_price, ROUND(AVG(availability_365), 1) AS avg_availability FROM listings WHERE price BETWEEN 20 AND 1000 -- Filter outliers GROUP BY neighbourhood_group, neighbourhood ) SELECT borough, SUM(listings) AS total_listings, ROUND(AVG(avg_price), 2) AS borough_avg_price, ROUND(AVG(median_price), 2) AS borough_median_price FROM neighborhood_stats GROUP BY borough ORDER BY borough_avg_price DESC; ``` **Key Finding:** Manhattan has 2x the median price of other boroughs. --- ### 📊 Part 2: Hypothesis Testing (Python) **Hypothesis:** Superhosts charge higher prices than regular hosts. ```python import pandas as pd from scipy import stats # Load data listings = pd.read_csv('listings.csv') # Filter to reasonable price range listings = listings[(listings['price'] >= 20) & (listings['price'] <= 1000)] # Split by superhost status superhosts = listings[listings['host_is_superhost'] == 't']['price'] regular_hosts = listings[listings['host_is_superhost'] == 'f']['price'] # Two-sample t-test t_stat, p_value = stats.ttest_ind(superhosts, regular_hosts, equal_var=False) # Effect size (Cohen's d) pooled_std = ((superhosts.std()**2 + regular_hosts.std()**2) / 2) ** 0.5 cohens_d = (superhosts.mean() - regular_hosts.mean()) / pooled_std print(f"Superhost mean: ${superhosts.mean():.2f}") print(f"Regular host mean: ${regular_hosts.mean():.2f}") print(f"Difference: ${superhosts.mean() - regular_hosts.mean():.2f}") print(f"p-value: {p_value:.4f}") print(f"Cohen's d: {cohens_d:.3f}") ``` **Result:** - Superhosts charge $12 more on average (p < 0.001) - Effect size is small (d = 0.15), but statistically significant - **Interpretation:** The premium is real but modest --- ### 📈 Part 3: Feature Importance (Python) ```python import pandas as pd import numpy as np from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt # Prepare features features = ['accommodates', 'bedrooms', 'bathrooms', 'availability_365', 'number_of_reviews', 'reviews_per_month', 'minimum_nights'] # Create dummy variables for room type listings_encoded = pd.get_dummies(listings, columns=['room_type']) X = listings_encoded[features + [c for c in listings_encoded.columns if c.startswith('room_type_')]] y = listings_encoded['price'] # Handle missing values X = X.fillna(X.median()) # Train model X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1) rf.fit(X_train, y_train) # Feature importance importance_df = pd.DataFrame({ 'feature': X.columns, 'importance': rf.feature_importances_ }).sort_values('importance', ascending=False) print("Top 5 Price Drivers:") print(importance_df.head()) ``` **Key Finding:** Top 3 drivers are: 1. Number of bedrooms (35% importance) 2. Accommodates (25% importance) 3. Room type = Entire home (20% importance) --- ### 🎤 Part 4: The Presentation (5 Slides) #### Slide 1: The Problem > "Airbnb hosts often misprice their listings, leaving money on the table or sitting vacant. What actually drives price?" #### Slide 2: The Approach - Dataset: 50,000+ NYC Airbnb listings - Analysis: SQL exploration → Statistical testing → Feature modeling - Validation: 80/20 train-test split #### Slide 3: Key Insight > "**Bedrooms and capacity drive 60% of pricing** — but Superhost status only adds a $12 premium despite the badge." *Include a feature importance bar chart* #### Slide 4: Recommendation > "For hosts: Focus on accurately conveying capacity. For Airbnb: Consider making Superhost benefits more visible in search." #### Slide 5: Next Steps > "Proposed experiment: A/B test showing 'Superhost' badge more prominently in search results. Hypothesis: Will increase Superhost booking rate by 5%." --- ### 📁 Project Structure ``` airbnb-price-analysis/ ├── README.md # Overview, key findings, how to run ├── data/ │ └── listings.csv # Raw data (or download script) ├── notebooks/ │ ├── 01_exploration.ipynb # SQL-like exploration in Python │ ├── 02_hypothesis_tests.ipynb │ └── 03_feature_analysis.ipynb ├── sql/ │ └── exploration_queries.sql ├── presentation/ │ └── airbnb_pricing.pdf # Final slides └── requirements.txt ```📝 Capstone Presentation Framework
The "SCR" Framework (Situation-Complication-Resolution)
Every great analytics presentation follows this structure: **Situation (Slide 1):** Set the stage - What's the business context? - Why does this matter now? - What question are we answering? **Complication (Slide 2-3):** The twist - What did we find that was surprising? - What's the tension or trade-off? - What data supports this? **Resolution (Slide 4-5):** The recommendation - What should we do? - What's the expected impact? - What are the next steps or experiments? ### Example: "User Churn Analysis" | Section | Content | |---------|---------| | **Situation** | "Our 30-day churn rate has increased from 12% to 18% over the past quarter." | | **Complication** | "We found that 80% of churned users never completed onboarding. But surprisingly, users who skipped the tutorial had HIGHER retention than those who completed it halfway." | | **Resolution** | "Recommendation: Shorten the tutorial from 7 steps to 3. A/B test this with 20% of new users. Expected impact: Reduce churn by 3 percentage points." |Storytelling Anti-Patterns (What NOT to Do)
| ❌ Anti-Pattern | Why It's Bad | ✅ Do This Instead | |-----------------|--------------|-------------------| | "Here's a dashboard" | No narrative; audience draws wrong conclusions | "Here's what I found and what we should do" | | "The data shows..." (passive voice) | Weak; hides ownership | "I found..." or "We discovered..." | | Starting with methodology | Loses audience; no hook | Start with the insight or question | | 10 charts, 3 insights | Overwhelms; dilutes message | 3 charts maximum; one insight per slide | | "It's complicated" | Sounds like excuse | "The key factor is X, though Y also matters" | | No recommendation | Analysis without action is useless | Always end with "We should..." | ### The "So What?" Test For every slide, ask: "So what?" - ❌ "Users in California have higher LTV than Texas" — So what? - ✅ "Users in California have 40% higher LTV, which means we should increase CAC budget for CA by 20%" — Clear action!💬 Week 8 Discussion Prompts
- “Share your capstone topic for feedback” — Get input before you’re too deep.
- “Post your presentation and ask for one improvement suggestion” — Practice receiving feedback.
- “What’s the hardest thing about storytelling with data?” — Share struggles and tips.
⚠️ Common Week 8 Mistakes
| Mistake | Why It’s Wrong | Fix |
|---|---|---|
| Choosing too broad a topic | Can’t complete in a week | Narrow to one specific question |
| Spending 80% of time on data cleaning | Shows in the presentation as rushed analysis | Timebox cleaning; use a mostly-clean dataset |
| No recommendation | Analysis without action is wasted work | Every presentation needs “We should…” |
| Burying the insight | Audience loses interest before the payoff | Lead with the headline, then support it |
| Perfect is the enemy of done | Never shipping = no portfolio piece | Ship something imperfect; iterate later |
✅ Week 8 Self-Assessment
Before claiming your Champion badge, confirm:
- Does your capstone showcase SQL, stats, AND Python?
- Does your presentation follow Situation → Complication → Resolution?
- Does every slide answer “So what?”
- Have you received feedback from at least one other person?
- Is your work published (GitHub, LinkedIn, or personal site)?
🎯 Capstone Peer Review Checklist
When reviewing a peer’s capstone:
| Category | Question | Score (1-5) |
|---|---|---|
| Clarity | Is the question clearly stated in the first 30 seconds? | |
| Data | Is the data source credible and appropriate? | |
| Analysis | Does the analysis use appropriate methods for the question? | |
| Insight | Is the key finding surprising, important, or actionable? | |
| Recommendation | Is there a clear “We should…” statement? | |
| Storytelling | Does the narrative flow logically? | |
| Visuals | Are charts clear and labeled appropriately? | |
| Polish | Is the code clean and notebook well-organized? |
Feedback format: “I liked [specific thing]. One suggestion: [specific improvement].”
Milestone: Capstone notebook + slide deck published. 🎉
Tracking Your Progress
GitHub Progress Tracker
Create a public GitHub repo called 2026-analytics-challenge to track your journey:
# Create your challenge repo
mkdir 2026-analytics-challenge
cd 2026-analytics-challenge
git init
echo "# 🎯 My 2026 Analytics Challenge Journey" > README.md
echo "" >> README.md
echo "" >> README.md
Progress Checklist
Copy this to your repo’s README or a Notion page:
## 🗓️ Challenge Progress
### Week 1: SQL Fundamentals
- [ ] Day 1: SELECT, WHERE, ORDER (3 problems)
- [ ] Day 2: JOINs (3 problems)
- [ ] Day 3: GROUP BY, HAVING (3 problems)
- [ ] Day 4: Subqueries (2 problems)
- [ ] Day 5: CTEs (3 problems)
- [ ] Day 6-7: Mini-project query
- [ ] **Posted Week 1 check-in to Discussions**
### Week 2: Window Functions & CTEs
- [ ] Day 8: ROW_NUMBER, RANK (2 problems)
- [ ] Day 9: LAG/LEAD (2 problems)
- [ ] Day 10: Running totals (2 problems)
- [ ] Day 11: NTILE, PERCENT_RANK
- [ ] Day 12: Cohort retention query
- [ ] Day 13-14: Visualization
- [ ] **Posted Week 2 check-in to Discussions**
### Week 3: Statistics
- [ ] Day 15: Distributions
- [ ] Day 16: CLT simulation
- [ ] Day 17: Hypothesis testing
- [ ] Day 18: p-values & CI explainer
- [ ] Day 19: Power & sample size
- [ ] Day 20-21: Sample size script
- [ ] **Posted Week 3 check-in to Discussions**
### Week 4: A/B Testing
- [ ] Day 22: Experiment design
- [ ] Day 23: Randomization
- [ ] Day 24: Analyze mock result
- [ ] Day 25: Edge cases
- [ ] Day 26: Common pitfalls
- [ ] Day 27-28: Full experiment doc
- [ ] **Posted Week 4 check-in to Discussions**
- [ ] 🎯 **HALFWAY BADGE EARNED!**
### Week 5: Python/Pandas
- [ ] Day 29: DataFrame basics
- [ ] Day 30: Data cleaning
- [ ] Day 31: Groupby & aggregation
- [ ] Day 32: Merging & joining
- [ ] Day 33: Visualization
- [ ] Day 34-35: EDA notebook
- [ ] **Posted Week 5 check-in to Discussions**
### Week 6: Product Metrics
- [ ] Day 36: Metrics frameworks
- [ ] Day 37: North Star metrics
- [ ] Day 38: Case study #1
- [ ] Day 39: Case study #2
- [ ] Day 40: Case study #3
- [ ] Day 41-42: Written case doc
- [ ] **Posted Week 6 check-in to Discussions**
### Week 7: Analytical Engineering
- [ ] Day 43: Data modeling layers
- [ ] Day 44: De-duplication
- [ ] Day 45: Idempotent loads
- [ ] Day 46: Data quality checks
- [ ] Day 47: Date spines
- [ ] Day 48-49: Pipeline project
- [ ] **Posted Week 7 check-in to Discussions**
### Week 8: Capstone
- [ ] Day 50: Choose topic
- [ ] Day 51: Data prep & EDA
- [ ] Day 52: Analysis
- [ ] Day 53: Build presentation
- [ ] Day 54: Present/record
- [ ] Day 55-56: Polish & publish
- [ ] **Posted Capstone to Discussions**
- [ ] 🏆 **CHAMPION BADGE EARNED!**
Social Accountability
Share weekly wins on social media with #2026AnalyticsChallenge for accountability.
LinkedIn Post Template:
🎯 Week [X] of #2026AnalyticsChallenge complete!
This week I learned:
• [Key insight 1]
• [Key insight 2]
Check out my work: [GitHub link]
Thanks to @Data-Science-Analytical-Handbook for the structure!
#DataScience #SQL #Analytics #LearningInPublic
What You’ll Have by March 1, 2026
| Artifact | Purpose |
|---|---|
| 25+ SQL problems solved | Interview-ready speed |
| Stats & sample-size scripts | Demonstrate rigor |
| EDA + cohort notebooks | Portfolio projects |
| Experiment write-up | Show experimentation fluency |
| Staging → mart pipeline | Analytical engineering proof |
| Capstone presentation | Storytelling & executive presence |
| GitHub Challenge Badge | Visible proof of commitment |
| 3-Minute Video | Demonstrates communication skills |
| Digital Certificate | LinkedIn-ready proof of completion |
📣 Community & Marketing
Cohort Start Dates (Join a Community)
Instead of going it alone, join an official cohort for accountability and community:
| Cohort | Start Date | Registration Deadline |
|---|---|---|
| Winter 2026 | January 1, 2026 | December 25, 2025 |
| Spring 2026 | April 1, 2026 | March 25, 2026 |
| Fall 2026 | September 1, 2026 | August 25, 2026 |
Why Join a Cohort?
- Weekly live AMAs (Twitter Spaces / LinkedIn Audio)
- Dedicated Slack/Discord channel for your cohort
- Peer accountability partners
- Weekly “Spotlight” features for best submissions
The “Commitment Contract” (Social Accountability)
Want to 10x your chances of completing the challenge? Make a public commitment:
- Download this image and post it to LinkedIn/Twitter:
🎯 I’m committed to the 2026 Analytics Challenge! For the next 8 weeks, I’ll be leveling up my SQL, Python, statistics, and storytelling skills. Follow my journey: #2026AnalyticsChallenge github.com/moshesham/Data-Science-Analytical-Handbook
-
Star the repo ⭐ — helps others discover the challenge and unlocks bonus materials!
- Post weekly updates with #2026AnalyticsChallenge
Weekly Live AMAs
Every Friday during the challenge, join a 30-minute live session:
- Platform: Twitter Spaces or LinkedIn Audio (links posted in Discussions)
- Format: 10-min recap of the week + 20-min Q&A
- Topics: That week’s content, common struggles, career advice
Spotlight Rewards
Each week, we feature the best submissions:
- Best SQL Query (Week 1-2)
- Best Statistical Explanation (Week 3)
- Best Experiment Design (Week 4)
- Best EDA Notebook (Week 5)
- Best Case Study Answer (Week 6)
- Best Pipeline (Week 7)
- Best Capstone (Week 8)
Winners get featured in the main README and earn a special “Spotlight” badge!
🎖️ Hall of Fame
Completed the challenge? Get recognized!
- Post your capstone to Discussions
- Submit a PR adding your name to the Hall of Fame
- Earn your Champion badge for your profile
| Name | GitHub | Capstone Project | Completion Date |
|---|---|---|---|
| Your name here! | @username | Project title | Feb 2026 |
Additional Resources
- 21-Day Preparation Guide
- SQL Example Problems
- Statistics & Probability Example Questions
- Behavioral Mock Interview
🎯 Start Your Journey Today
Don't wait for January 1st—start preparing now!
Good luck—see you on the other side of the challenge! 🚀