A/B Testing & Experimentation
Design, analyze, and interpret experiments like a senior data scientist.
Why This Matters
A/B testing is the backbone of data-driven product development. At companies like Meta, Google, Netflix, and Amazon, every major product change goes through experimentation. Expect 30%+ of your DS interview to focus on experiment design, analysis, and interpretation.
What interviewers are looking for:
- Can you design a valid experiment from scratch?
- Do you understand when experiments can go wrong?
- Can you interpret results correctly, including edge cases?
- Do you know when NOT to run an A/B test?
The A/B Testing Mental Model
┌─────────────────────────────────────────────────────────────────┐
│ 1. DEFINE THE QUESTION │
│ → What are we trying to learn? │
│ → What decision will we make based on results? │
├─────────────────────────────────────────────────────────────────┤
│ 2. DESIGN THE EXPERIMENT │
│ → Hypothesis (H₀ and H₁) │
│ → Metrics (primary, secondary, guardrail) │
│ → Sample size & duration │
│ → Randomization unit │
├─────────────────────────────────────────────────────────────────┤
│ 3. RUN THE EXPERIMENT │
│ → Validate randomization (A/A check) │
│ → Monitor for bugs & anomalies │
│ → DON'T peek and make decisions early! │
├─────────────────────────────────────────────────────────────────┤
│ 4. ANALYZE RESULTS │
│ → Statistical significance │
│ → Practical significance (effect size) │
│ → Segment analysis (but beware multiple testing) │
├─────────────────────────────────────────────────────────────────┤
│ 5. MAKE A DECISION │
│ → Ship, iterate, or kill │
│ → Document learnings │
└─────────────────────────────────────────────────────────────────┘
Key Concepts You Must Know
1. Hypothesis Testing Fundamentals
- Null Hypothesis (H₀): There is no difference between control and treatment
- Alternative Hypothesis (H₁): There is a difference
- p-value: Probability of seeing data this extreme IF H₀ is true
- Significance level (α): Threshold for rejecting H₀ (typically 0.05)
- Power (1-β): Probability of detecting a real effect (typically 0.80)
2. Error Types
| H₀ True (No Effect) | H₀ False (Real Effect) | |
|---|---|---|
| Reject H₀ | Type I Error (α) – False Positive | ✅ Correct – True Positive |
| Fail to Reject H₀ | ✅ Correct – True Negative | Type II Error (β) – False Negative |
3. Sample Size Formula (Proportions)
For a two-proportion z-test with equal group sizes:
n = 2 × [(Z_α/2 + Z_β)² × p̄(1-p̄)] / (p₁ - p₂)²
where:
p̄ = (p₁ + p₂) / 2 (pooled proportion)
p₁ = baseline conversion rate
p₂ = expected treatment conversion rate
Z_α/2 = 1.96 for α=0.05 (two-tailed)
Z_β = 0.84 for power=0.80
Quick Python implementation:
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize
# Example: baseline 5%, want to detect 5.5% (10% relative lift)
effect_size = proportion_effectsize(0.05, 0.055)
analysis = NormalIndPower()
n = analysis.solve_power(effect_size=effect_size, alpha=0.05, power=0.8, alternative='two-sided')
print(f"Sample size per group: {n:.0f}") # ~31,000
Metrics Framework
Every experiment needs three types of metrics:
| Metric Type | Purpose | Example |
|---|---|---|
| Primary (North Star) | The ONE metric that determines success | Conversion rate, DAU, Revenue/user |
| Secondary | Additional insights, explain mechanisms | Click-through rate, time on page |
| Guardrail | Ensure we don't break something important | Page load time, error rate, unsubscribes |
Example: Testing a new checkout flow
- Primary: Purchase completion rate
- Secondary: Average order value, cart abandonment rate
- Guardrail: Page load time, support tickets, refund rate
Common Pitfalls & How to Avoid Them
1. Peeking Problem
What: Checking results multiple times and stopping when significant.
Why it's bad: Inflates false positive rate. With daily peeking for 20 days at α=0.05, your actual false positive rate can exceed 25%!
Fix: Pre-commit to a sample size and analysis date. If you must peek, use sequential testing (e.g., alpha spending functions).
2. Multiple Testing Problem
What: Testing 20 metrics and declaring victory on the one that's significant.
Why it's bad: At α=0.05, testing 20 metrics gives ~64% chance of at least one false positive.
Fix: Pre-specify ONE primary metric. Apply Bonferroni (α/n) or Benjamini-Hochberg (FDR) corrections for secondaries.
3. Network Effects / Spillover
What: Treatment users affect control users through social connections.
Why it's bad: Dilutes treatment effect; biases toward null.
Fix: Cluster randomization (randomize by geography, community, or time).
4. Novelty / Primacy Effects
What: New features show temporary lift from curiosity, or underperform while users adapt.
Fix: Run experiments for 2+ weeks; segment by new vs returning users; look at effect over time.
5. Simpson's Paradox
What: Treatment wins overall but loses in every segment (or vice versa).
Why it happens: Unequal segment sizes between variants.
Fix: Always check segment-level results; investigate sample ratio mismatch (SRM).
🧠 Challenge Questions with Solutions
Challenge 1: Design an A/B Test
Scenario: Instagram is considering changing the "Like" button from a heart to a thumbs-up. Design the experiment.
Solution Framework:
- Hypothesis:
- H₀: Thumbs-up will not change like rate
- H₁: Thumbs-up will change like rate (two-tailed since we're unsure of direction)
- Metrics:
- Primary: Like rate (likes / impressions)
- Secondary: Engagement rate, time spent, content creation rate
- Guardrail: DAU, session duration, negative feedback rate
- Sample Size:
- Assume baseline like rate = 5%
- MDE = 2% relative (detect 5% → 5.1%)
- α = 0.05, power = 0.80
- Result: ~1.5M users per variant (use calculator)
- Randomization:
- Unit: User ID (not session, not device)
- Consider cluster randomization by region if network effects expected
- Duration:
- Minimum 2 weeks to capture weekly cycles
- Consider novelty effects—new icon might get more clicks initially
- Risks:
- Strong brand association with heart—user backlash
- Novelty effect—temporary lift from curiosity
- Should run sentiment analysis alongside quantitative metrics
Challenge 2: Interpret Conflicting Results
Scenario: Your A/B test shows:
- Primary metric (conversion): +3% (p=0.03) ✅
- Guardrail metric (page load time): +200ms (p=0.001) ❌
What do you recommend?
Solution:
- Acknowledge the trade-off: We have a conversion win but a performance regression.
- Quantify the trade-off:
- What's the revenue impact of +3% conversion?
- What's the long-term cost of +200ms load time? (Research shows ~1% bounce per 100ms)
- Investigate root cause:
- Is the load time increase inherent to the feature, or a fixable implementation issue?
- Segment by device/connection—is it only affecting slow connections?
- Recommendation:
- If load time is fixable: Hold launch, fix performance, re-test
- If inherent trade-off: Model long-term impact; usually don't ship perf regressions
Key insight: Never ignore guardrail metrics. Short-term wins often become long-term losses.
Challenge 3: Low Power, What Now?
Scenario: Your sample size calculation shows you need 2M users per variant, but you only get 500K users/week. The experiment would take 2 months. What are your options?
Solution Options:
| Option | Trade-off | When to Use |
|---|---|---|
| Increase MDE | Can only detect larger effects | If smaller effects aren't worth shipping anyway |
| Reduce power to 0.70 | 30% chance of missing real effect | Low-stakes decisions |
| Use one-tailed test | Can't detect negative effects | Only if you'd never ship a negative result |
| Variance reduction (CUPED) | Requires pre-experiment data | Best option if feasible |
| Target high-impact segment | Results may not generalize | Power users, specific geo |
| Just wait (run longer) | Delays product roadmap | High-stakes decisions |
CUPED (Controlled-experiment Using Pre-Experiment Data):
# CUPED reduces variance by controlling for pre-experiment behavior
# Adjusted metric: Y_adj = Y - θ × (X - X̄)
# where X = pre-experiment value of the metric
# θ = Cov(Y, X) / Var(X)
# Can reduce required sample size by 50%+ in some cases
Challenge 4: Sample Ratio Mismatch (SRM)
Scenario: Your 50/50 experiment shows 1,020,000 users in control and 980,000 in treatment. Is this a problem?
Solution:
from scipy.stats import chi2_contingency
observed = [1020000, 980000]
expected = [1000000, 1000000]
# Chi-squared test for SRM
chi2 = sum((o - e)**2 / e for o, e in zip(observed, expected))
# chi2 = 800
# For df=1, chi2 > 3.84 is significant at α=0.05
# This is HIGHLY significant - something is wrong!
Common causes of SRM:
- Bot filtering applied differently
- Redirect/loading issues in one variant
- Bucketing bug in the experiment system
- Treatment causing more users to log out (lose tracking)
Action: DO NOT interpret results until SRM is resolved. Investigate root cause first.
Challenge 5: When NOT to A/B Test
Question: Give 5 scenarios where A/B testing is NOT appropriate.
Solution:
- Obvious improvements: Fixing a bug, improving load time. Just ship it.
- Legal/compliance changes: GDPR requirements. No choice but to comply.
- Low traffic: Would take years to reach significance. Use qualitative research.
- Network effects dominate: Marketplace features where treatment affects control through shared inventory.
- Long-term effects matter most: Education/habit-forming features. Effect emerges over months, not weeks.
- Ethical concerns: Testing features that could harm users (e.g., addiction-promoting).
- Launch-and-iterate is cheaper: Low-risk UI changes with easy rollback.
Alternative methods:
- User research & qualitative testing
- Quasi-experimental designs (diff-in-diff, regression discontinuity)
- Holdout/long-term experiments
- Synthetic control methods
Interview Cheat Sheet
| Question Type | What They're Testing | Key Points to Hit |
|---|---|---|
| "Design an A/B test for..." | Structured thinking | Hypothesis → Metrics → Sample size → Randomization → Duration → Risks |
| "Results are significant but..." | Critical thinking | Statistical vs practical significance, trade-offs, segment analysis |
| "What could go wrong?" | Experience | SRM, novelty, spillover, multiple testing, Simpson's paradox |
| "Results are flat, what now?" | Pragmatism | Check power, segment, don't ship (absence of evidence ≠ evidence of absence) |
💬 Discussion Prompts
- "What's your process for choosing MDE?" — Share how you balance business needs with statistical feasibility.
- "Describe a time an experiment surprised you" — Post your war stories for others to learn from.
- "How do you handle stakeholders who want to peek?" — Share strategies for educating non-technical partners.
✅ Self-Assessment
Before moving on, confirm you can:
- ☐ Calculate sample size using the formula AND explain the intuition
- ☐ Design an experiment with hypothesis, metrics, and randomization plan
- ☐ Explain Type I and Type II errors to a non-technical PM
- ☐ Identify at least 5 common A/B testing pitfalls
- ☐ Recommend when NOT to run an A/B test