A/B Testing & Experimentation

Design, analyze, and interpret experiments like a senior data scientist.

Why This Matters

A/B testing is the backbone of data-driven product development. At companies like Meta, Google, Netflix, and Amazon, every major product change goes through experimentation. Expect 30%+ of your DS interview to focus on experiment design, analysis, and interpretation.

What interviewers are looking for:

  • Can you design a valid experiment from scratch?
  • Do you understand when experiments can go wrong?
  • Can you interpret results correctly, including edge cases?
  • Do you know when NOT to run an A/B test?

The A/B Testing Mental Model

┌─────────────────────────────────────────────────────────────────┐
│  1. DEFINE THE QUESTION                                         │
│     → What are we trying to learn?                              │
│     → What decision will we make based on results?              │
├─────────────────────────────────────────────────────────────────┤
│  2. DESIGN THE EXPERIMENT                                       │
│     → Hypothesis (H₀ and H₁)                                    │
│     → Metrics (primary, secondary, guardrail)                   │
│     → Sample size & duration                                    │
│     → Randomization unit                                        │
├─────────────────────────────────────────────────────────────────┤
│  3. RUN THE EXPERIMENT                                          │
│     → Validate randomization (A/A check)                        │
│     → Monitor for bugs & anomalies                              │
│     → DON'T peek and make decisions early!                      │
├─────────────────────────────────────────────────────────────────┤
│  4. ANALYZE RESULTS                                             │
│     → Statistical significance                                  │
│     → Practical significance (effect size)                      │
│     → Segment analysis (but beware multiple testing)            │
├─────────────────────────────────────────────────────────────────┤
│  5. MAKE A DECISION                                             │
│     → Ship, iterate, or kill                                    │
│     → Document learnings                                        │
└─────────────────────────────────────────────────────────────────┘
    

Key Concepts You Must Know

1. Hypothesis Testing Fundamentals

  • Null Hypothesis (H₀): There is no difference between control and treatment
  • Alternative Hypothesis (H₁): There is a difference
  • p-value: Probability of seeing data this extreme IF H₀ is true
  • Significance level (α): Threshold for rejecting H₀ (typically 0.05)
  • Power (1-β): Probability of detecting a real effect (typically 0.80)

2. Error Types

H₀ True (No Effect) H₀ False (Real Effect)
Reject H₀ Type I Error (α) – False Positive ✅ Correct – True Positive
Fail to Reject H₀ ✅ Correct – True Negative Type II Error (β) – False Negative

3. Sample Size Formula (Proportions)

For a two-proportion z-test with equal group sizes:

n = 2 × [(Z_α/2 + Z_β)² × p̄(1-p̄)] / (p₁ - p₂)²

where:
  p̄ = (p₁ + p₂) / 2  (pooled proportion)
  p₁ = baseline conversion rate
  p₂ = expected treatment conversion rate
  Z_α/2 = 1.96 for α=0.05 (two-tailed)
  Z_β = 0.84 for power=0.80

Quick Python implementation:

from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

# Example: baseline 5%, want to detect 5.5% (10% relative lift)
effect_size = proportion_effectsize(0.05, 0.055)
analysis = NormalIndPower()
n = analysis.solve_power(effect_size=effect_size, alpha=0.05, power=0.8, alternative='two-sided')
print(f"Sample size per group: {n:.0f}")  # ~31,000

Metrics Framework

Every experiment needs three types of metrics:

Metric Type Purpose Example
Primary (North Star) The ONE metric that determines success Conversion rate, DAU, Revenue/user
Secondary Additional insights, explain mechanisms Click-through rate, time on page
Guardrail Ensure we don't break something important Page load time, error rate, unsubscribes

Example: Testing a new checkout flow

  • Primary: Purchase completion rate
  • Secondary: Average order value, cart abandonment rate
  • Guardrail: Page load time, support tickets, refund rate

Common Pitfalls & How to Avoid Them

1. Peeking Problem

What: Checking results multiple times and stopping when significant.

Why it's bad: Inflates false positive rate. With daily peeking for 20 days at α=0.05, your actual false positive rate can exceed 25%!

Fix: Pre-commit to a sample size and analysis date. If you must peek, use sequential testing (e.g., alpha spending functions).

2. Multiple Testing Problem

What: Testing 20 metrics and declaring victory on the one that's significant.

Why it's bad: At α=0.05, testing 20 metrics gives ~64% chance of at least one false positive.

Fix: Pre-specify ONE primary metric. Apply Bonferroni (α/n) or Benjamini-Hochberg (FDR) corrections for secondaries.

3. Network Effects / Spillover

What: Treatment users affect control users through social connections.

Why it's bad: Dilutes treatment effect; biases toward null.

Fix: Cluster randomization (randomize by geography, community, or time).

4. Novelty / Primacy Effects

What: New features show temporary lift from curiosity, or underperform while users adapt.

Fix: Run experiments for 2+ weeks; segment by new vs returning users; look at effect over time.

5. Simpson's Paradox

What: Treatment wins overall but loses in every segment (or vice versa).

Why it happens: Unequal segment sizes between variants.

Fix: Always check segment-level results; investigate sample ratio mismatch (SRM).

🧠 Challenge Questions with Solutions

Challenge 1: Design an A/B Test

Scenario: Instagram is considering changing the "Like" button from a heart to a thumbs-up. Design the experiment.

Solution Framework:

  1. Hypothesis:
    • H₀: Thumbs-up will not change like rate
    • H₁: Thumbs-up will change like rate (two-tailed since we're unsure of direction)
  2. Metrics:
    • Primary: Like rate (likes / impressions)
    • Secondary: Engagement rate, time spent, content creation rate
    • Guardrail: DAU, session duration, negative feedback rate
  3. Sample Size:
    • Assume baseline like rate = 5%
    • MDE = 2% relative (detect 5% → 5.1%)
    • α = 0.05, power = 0.80
    • Result: ~1.5M users per variant (use calculator)
  4. Randomization:
    • Unit: User ID (not session, not device)
    • Consider cluster randomization by region if network effects expected
  5. Duration:
    • Minimum 2 weeks to capture weekly cycles
    • Consider novelty effects—new icon might get more clicks initially
  6. Risks:
    • Strong brand association with heart—user backlash
    • Novelty effect—temporary lift from curiosity
    • Should run sentiment analysis alongside quantitative metrics
Challenge 2: Interpret Conflicting Results

Scenario: Your A/B test shows:

  • Primary metric (conversion): +3% (p=0.03) ✅
  • Guardrail metric (page load time): +200ms (p=0.001) ❌

What do you recommend?

Solution:

  1. Acknowledge the trade-off: We have a conversion win but a performance regression.
  2. Quantify the trade-off:
    • What's the revenue impact of +3% conversion?
    • What's the long-term cost of +200ms load time? (Research shows ~1% bounce per 100ms)
  3. Investigate root cause:
    • Is the load time increase inherent to the feature, or a fixable implementation issue?
    • Segment by device/connection—is it only affecting slow connections?
  4. Recommendation:
    • If load time is fixable: Hold launch, fix performance, re-test
    • If inherent trade-off: Model long-term impact; usually don't ship perf regressions

Key insight: Never ignore guardrail metrics. Short-term wins often become long-term losses.

Challenge 3: Low Power, What Now?

Scenario: Your sample size calculation shows you need 2M users per variant, but you only get 500K users/week. The experiment would take 2 months. What are your options?

Solution Options:

Option Trade-off When to Use
Increase MDE Can only detect larger effects If smaller effects aren't worth shipping anyway
Reduce power to 0.70 30% chance of missing real effect Low-stakes decisions
Use one-tailed test Can't detect negative effects Only if you'd never ship a negative result
Variance reduction (CUPED) Requires pre-experiment data Best option if feasible
Target high-impact segment Results may not generalize Power users, specific geo
Just wait (run longer) Delays product roadmap High-stakes decisions

CUPED (Controlled-experiment Using Pre-Experiment Data):

# CUPED reduces variance by controlling for pre-experiment behavior
# Adjusted metric: Y_adj = Y - θ × (X - X̄)
# where X = pre-experiment value of the metric
# θ = Cov(Y, X) / Var(X)

# Can reduce required sample size by 50%+ in some cases
Challenge 4: Sample Ratio Mismatch (SRM)

Scenario: Your 50/50 experiment shows 1,020,000 users in control and 980,000 in treatment. Is this a problem?

Solution:

from scipy.stats import chi2_contingency

observed = [1020000, 980000]
expected = [1000000, 1000000]

# Chi-squared test for SRM
chi2 = sum((o - e)**2 / e for o, e in zip(observed, expected))
# chi2 = 800

# For df=1, chi2 > 3.84 is significant at α=0.05
# This is HIGHLY significant - something is wrong!

Common causes of SRM:

  • Bot filtering applied differently
  • Redirect/loading issues in one variant
  • Bucketing bug in the experiment system
  • Treatment causing more users to log out (lose tracking)

Action: DO NOT interpret results until SRM is resolved. Investigate root cause first.

Challenge 5: When NOT to A/B Test

Question: Give 5 scenarios where A/B testing is NOT appropriate.

Solution:

  1. Obvious improvements: Fixing a bug, improving load time. Just ship it.
  2. Legal/compliance changes: GDPR requirements. No choice but to comply.
  3. Low traffic: Would take years to reach significance. Use qualitative research.
  4. Network effects dominate: Marketplace features where treatment affects control through shared inventory.
  5. Long-term effects matter most: Education/habit-forming features. Effect emerges over months, not weeks.
  6. Ethical concerns: Testing features that could harm users (e.g., addiction-promoting).
  7. Launch-and-iterate is cheaper: Low-risk UI changes with easy rollback.

Alternative methods:

  • User research & qualitative testing
  • Quasi-experimental designs (diff-in-diff, regression discontinuity)
  • Holdout/long-term experiments
  • Synthetic control methods

Interview Cheat Sheet

Question Type What They're Testing Key Points to Hit
"Design an A/B test for..." Structured thinking Hypothesis → Metrics → Sample size → Randomization → Duration → Risks
"Results are significant but..." Critical thinking Statistical vs practical significance, trade-offs, segment analysis
"What could go wrong?" Experience SRM, novelty, spillover, multiple testing, Simpson's paradox
"Results are flat, what now?" Pragmatism Check power, segment, don't ship (absence of evidence ≠ evidence of absence)

💬 Discussion Prompts

  1. "What's your process for choosing MDE?" — Share how you balance business needs with statistical feasibility.
  2. "Describe a time an experiment surprised you" — Post your war stories for others to learn from.
  3. "How do you handle stakeholders who want to peek?" — Share strategies for educating non-technical partners.

✅ Self-Assessment

Before moving on, confirm you can:

  • ☐ Calculate sample size using the formula AND explain the intuition
  • ☐ Design an experiment with hypothesis, metrics, and randomization plan
  • ☐ Explain Type I and Type II errors to a non-technical PM
  • ☐ Identify at least 5 common A/B testing pitfalls
  • ☐ Recommend when NOT to run an A/B test
60 mins Intermediate