A/B Testing & Experimentation

Why This Matters

A/B testing is the backbone of data-driven product development. At companies like Meta, Google, Netflix, and Amazon, every major product change goes through experimentation. Expect 30%+ of your DS interview to focus on experiment design, analysis, and interpretation.

What interviewers are looking for:

Can you design a valid experiment from scratch?
Do you understand when experiments can go wrong?
Can you interpret results correctly, including edge cases?
Do you know when NOT to run an A/B test?

The A/B Testing Mental Model

┌─────────────────────────────────────────────────────────────────┐
│  1. DEFINE THE QUESTION                                         │
│     → What are we trying to learn?                              │
│     → What decision will we make based on results?              │
├─────────────────────────────────────────────────────────────────┤
│  2. DESIGN THE EXPERIMENT                                       │
│     → Hypothesis (H₀ and H₁)                                    │
│     → Metrics (primary, secondary, guardrail)                   │
│     → Sample size & duration                                    │
│     → Randomization unit                                        │
├─────────────────────────────────────────────────────────────────┤
│  3. RUN THE EXPERIMENT                                          │
│     → Validate randomization (A/A check)                        │
│     → Monitor for bugs & anomalies                              │
│     → DON'T peek and make decisions early!                      │
├─────────────────────────────────────────────────────────────────┤
│  4. ANALYZE RESULTS                                             │
│     → Statistical significance                                  │
│     → Practical significance (effect size)                      │
│     → Segment analysis (but beware multiple testing)            │
├─────────────────────────────────────────────────────────────────┤
│  5. MAKE A DECISION                                             │
│     → Ship, iterate, or kill                                    │
│     → Document learnings                                        │
└─────────────────────────────────────────────────────────────────┘

Key Concepts You Must Know

1. Hypothesis Testing Fundamentals

Null Hypothesis (H₀): There is no difference between control and treatment
Alternative Hypothesis (H₁): There is a difference
p-value: Probability of seeing data this extreme IF H₀ is true
Significance level (α): Threshold for rejecting H₀ (typically 0.05)
Power (1-β): Probability of detecting a real effect (typically 0.80)

2. Error Types

	H₀ True (No Effect)	H₀ False (Real Effect)
Reject H₀	Type I Error (α) – False Positive	✅ Correct – True Positive
Fail to Reject H₀	✅ Correct – True Negative	Type II Error (β) – False Negative

3. Sample Size Formula (Proportions)

For a two-proportion z-test with equal group sizes:

n = 2 × [(Z_α/2 + Z_β)² × p̄(1-p̄)] / (p₁ - p₂)²

where:
  p̄ = (p₁ + p₂) / 2  (pooled proportion)
  p₁ = baseline conversion rate
  p₂ = expected treatment conversion rate
  Z_α/2 = 1.96 for α=0.05 (two-tailed)
  Z_β = 0.84 for power=0.80

Quick Python implementation:

from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

# Example: baseline 5%, want to detect 5.5% (10% relative lift)
effect_size = proportion_effectsize(0.05, 0.055)
analysis = NormalIndPower()
n = analysis.solve_power(effect_size=effect_size, alpha=0.05, power=0.8, alternative='two-sided')
print(f"Sample size per group: {n:.0f}")  # ~31,000

Metrics Framework

Every experiment needs three types of metrics:

Metric Type	Purpose	Example
Primary (North Star)	The ONE metric that determines success	Conversion rate, DAU, Revenue/user
Secondary	Additional insights, explain mechanisms	Click-through rate, time on page
Guardrail	Ensure we don't break something important	Page load time, error rate, unsubscribes

Example: Testing a new checkout flow

Primary: Purchase completion rate
Secondary: Average order value, cart abandonment rate
Guardrail: Page load time, support tickets, refund rate

Common Pitfalls & How to Avoid Them

1. Peeking Problem

What: Checking results multiple times and stopping when significant.

Why it's bad: Inflates false positive rate. With daily peeking for 20 days at α=0.05, your actual false positive rate can exceed 25%!

Fix: Pre-commit to a sample size and analysis date. If you must peek, use sequential testing (e.g., alpha spending functions).

2. Multiple Testing Problem

What: Testing 20 metrics and declaring victory on the one that's significant.

Why it's bad: At α=0.05, testing 20 metrics gives ~64% chance of at least one false positive.

Fix: Pre-specify ONE primary metric. Apply Bonferroni (α/n) or Benjamini-Hochberg (FDR) corrections for secondaries.

3. Network Effects / Spillover

What: Treatment users affect control users through social connections.

Why it's bad: Dilutes treatment effect; biases toward null.

Fix: Cluster randomization (randomize by geography, community, or time).

4. Novelty / Primacy Effects

What: New features show temporary lift from curiosity, or underperform while users adapt.

Fix: Run experiments for 2+ weeks; segment by new vs returning users; look at effect over time.

5. Simpson's Paradox

What: Treatment wins overall but loses in every segment (or vice versa).

Why it happens: Unequal segment sizes between variants.

Fix: Always check segment-level results; investigate sample ratio mismatch (SRM).

🧠 Challenge Questions with Solutions

Challenge 1: Design an A/B Test

Scenario: Instagram is considering changing the "Like" button from a heart to a thumbs-up. Design the experiment.

Solution Framework:

Hypothesis:
- H₀: Thumbs-up will not change like rate
- H₁: Thumbs-up will change like rate (two-tailed since we're unsure of direction)
Metrics:
- Primary: Like rate (likes / impressions)
- Secondary: Engagement rate, time spent, content creation rate
- Guardrail: DAU, session duration, negative feedback rate
Sample Size:
- Assume baseline like rate = 5%
- MDE = 2% relative (detect 5% → 5.1%)
- α = 0.05, power = 0.80
- Result: ~1.5M users per variant (use calculator)
Randomization:
- Unit: User ID (not session, not device)
- Consider cluster randomization by region if network effects expected
Duration:
- Minimum 2 weeks to capture weekly cycles
- Consider novelty effects—new icon might get more clicks initially
Risks:
- Strong brand association with heart—user backlash
- Novelty effect—temporary lift from curiosity
- Should run sentiment analysis alongside quantitative metrics

Challenge 2: Interpret Conflicting Results

Scenario: Your A/B test shows:

Primary metric (conversion): +3% (p=0.03) ✅
Guardrail metric (page load time): +200ms (p=0.001) ❌

What do you recommend?

Solution:

Acknowledge the trade-off: We have a conversion win but a performance regression.
Quantify the trade-off:
- What's the revenue impact of +3% conversion?
- What's the long-term cost of +200ms load time? (Research shows ~1% bounce per 100ms)
Investigate root cause:
- Is the load time increase inherent to the feature, or a fixable implementation issue?
- Segment by device/connection—is it only affecting slow connections?
Recommendation:
- If load time is fixable: Hold launch, fix performance, re-test
- If inherent trade-off: Model long-term impact; usually don't ship perf regressions

Key insight: Never ignore guardrail metrics. Short-term wins often become long-term losses.

Challenge 3: Low Power, What Now?

Scenario: Your sample size calculation shows you need 2M users per variant, but you only get 500K users/week. The experiment would take 2 months. What are your options?

Solution Options:

Option	Trade-off	When to Use
Increase MDE	Can only detect larger effects	If smaller effects aren't worth shipping anyway
Reduce power to 0.70	30% chance of missing real effect	Low-stakes decisions
Use one-tailed test	Can't detect negative effects	Only if you'd never ship a negative result
Variance reduction (CUPED)	Requires pre-experiment data	Best option if feasible
Target high-impact segment	Results may not generalize	Power users, specific geo
Just wait (run longer)	Delays product roadmap	High-stakes decisions

CUPED (Controlled-experiment Using Pre-Experiment Data):

# CUPED reduces variance by controlling for pre-experiment behavior
# Adjusted metric: Y_adj = Y - θ × (X - X̄)
# where X = pre-experiment value of the metric
# θ = Cov(Y, X) / Var(X)

# Can reduce required sample size by 50%+ in some cases

Challenge 4: Sample Ratio Mismatch (SRM)

Scenario: Your 50/50 experiment shows 1,020,000 users in control and 980,000 in treatment. Is this a problem?

Solution:

from scipy.stats import chi2_contingency

observed = [1020000, 980000]
expected = [1000000, 1000000]

# Chi-squared test for SRM
chi2 = sum((o - e)**2 / e for o, e in zip(observed, expected))
# chi2 = 800

# For df=1, chi2 > 3.84 is significant at α=0.05
# This is HIGHLY significant - something is wrong!

Common causes of SRM:

Bot filtering applied differently
Redirect/loading issues in one variant
Bucketing bug in the experiment system
Treatment causing more users to log out (lose tracking)

Action: DO NOT interpret results until SRM is resolved. Investigate root cause first.

Challenge 5: When NOT to A/B Test

Question: Give 5 scenarios where A/B testing is NOT appropriate.

Solution:

Obvious improvements: Fixing a bug, improving load time. Just ship it.
Legal/compliance changes: GDPR requirements. No choice but to comply.
Low traffic: Would take years to reach significance. Use qualitative research.
Network effects dominate: Marketplace features where treatment affects control through shared inventory.
Long-term effects matter most: Education/habit-forming features. Effect emerges over months, not weeks.
Ethical concerns: Testing features that could harm users (e.g., addiction-promoting).
Launch-and-iterate is cheaper: Low-risk UI changes with easy rollback.

Alternative methods:

User research & qualitative testing
Quasi-experimental designs (diff-in-diff, regression discontinuity)
Holdout/long-term experiments
Synthetic control methods

Interview Cheat Sheet

Question Type	What They're Testing	Key Points to Hit
"Design an A/B test for..."	Structured thinking	Hypothesis → Metrics → Sample size → Randomization → Duration → Risks
"Results are significant but..."	Critical thinking	Statistical vs practical significance, trade-offs, segment analysis
"What could go wrong?"	Experience	SRM, novelty, spillover, multiple testing, Simpson's paradox
"Results are flat, what now?"	Pragmatism	Check power, segment, don't ship (absence of evidence ≠ evidence of absence)

💬 Discussion Prompts

"What's your process for choosing MDE?" — Share how you balance business needs with statistical feasibility.
"Describe a time an experiment surprised you" — Post your war stories for others to learn from.
"How do you handle stakeholders who want to peek?" — Share strategies for educating non-technical partners.

✅ Self-Assessment

Before moving on, confirm you can:

☐ Calculate sample size using the formula AND explain the intuition
☐ Design an experiment with hypothesis, metrics, and randomization plan
☐ Explain Type I and Type II errors to a non-technical PM
☐ Identify at least 5 common A/B testing pitfalls
☐ Recommend when NOT to run an A/B test