Real-World Data Distributions in Product Analytics

Understanding data distributions is crucial for choosing the right statistical tests, designing experiments, and interpreting results correctly. This guide covers common distributions you’ll encounter in product analytics and data science interviews.

Why Distributions Matter

Statistical Tests: Many tests assume normality; knowing your distribution helps select appropriate methods
A/B Testing: Sample size calculations and result interpretation depend on data distribution
Outlier Detection: What’s an outlier depends on the expected distribution
Metric Design: Choosing mean vs. median depends on distribution shape

Common Distributions in Product Data

1. Normal (Gaussian) Distribution

Where You’ll See It:

Aggregated metrics over large samples (Central Limit Theorem)
Measurement errors
User session durations (after log transformation)
Physical measurements (height, weight)

Key Properties:

Symmetric, bell-shaped
Mean = Median = Mode
68-95-99.7 rule: 68% within ±1σ, 95% within ±2σ, 99.7% within ±3σ

When to Assume Normality:

Large sample sizes (n > 30) for sample means
After log transformation of right-skewed data
Residuals in regression analysis

import numpy as np
from scipy import stats

# Generate normal data
data = np.random.normal(loc=100, scale=15, size=1000)

# Check for normality
statistic, p_value = stats.shapiro(data)
print(f"Shapiro-Wilk test p-value: {p_value:.4f}")

2. Log-Normal Distribution

Where You’ll See It:

Revenue per user (many small, few large transactions)
Time between purchases
User engagement scores
Session duration
File sizes

Why It Matters: Many “long-tail” metrics in product analytics are log-normal. Taking the log makes them normal, enabling standard statistical tests.

Key Properties:

Right-skewed (long tail on the right)
Always positive values
Multiplicative processes (compounding effects)

Practical Tip:

import numpy as np

# Transform log-normal to normal
revenue = [10, 15, 20, 25, 100, 500, 1000]  # Skewed
log_revenue = np.log(revenue)  # Now approximately normal

# Use geometric mean instead of arithmetic mean
geometric_mean = np.exp(np.mean(log_revenue))

3. Poisson Distribution

Where You’ll See It:

Number of app opens per day per user
Customer support tickets per hour
Clicks on a page
Events in a time window
Number of errors per session

Key Properties:

Discrete counts (0, 1, 2, 3, …)
Mean = Variance (λ)
Useful for rare events

Interview Application: When asked about modeling counts or rates, consider Poisson. Check if mean ≈ variance to validate.

from scipy import stats

# Model daily app opens
lambda_param = 3  # Average opens per day
poisson = stats.poisson(lambda_param)

# Probability of exactly 5 opens
prob_5 = poisson.pmf(5)
print(f"P(X=5): {prob_5:.4f}")

4. Exponential Distribution

Where You’ll See It:

Time between user actions
Session inter-arrival times
Time to first conversion
Wait time until next event

Key Properties:

Continuous, always positive
Memoryless property (waiting doesn’t change future probability)
Often paired with Poisson (time between Poisson events)

Practical Use: Modeling time-to-event metrics like time to churn or time to first purchase.

5. Power Law (Pareto/Zipf) Distribution

Where You’ll See It:

User engagement (few users generate most content)
Revenue concentration (few customers drive most revenue)
Social connections (few users have most followers)
Product popularity (few products get most sales)
Word frequencies in text

Key Properties:

80/20 rule (Pareto principle)
No meaningful average (can be infinite)
Log-log plot appears linear

Why This Matters for Interviews: Understanding power laws helps explain:

Why averages can be misleading for engagement metrics
Why personalization focuses on “whale” users
Why A/B tests on engagement can be tricky

6. Binomial Distribution

Where You’ll See It:

Conversion rates (success/failure outcomes)
Click-through rates
A/B test success counts
Any yes/no outcome across n trials

Key Properties:

n independent trials, each with probability p of success
Mean = np, Variance = np(1-p)
Approximates normal for large n

A/B Testing Application:

from scipy import stats

# Conversion test: 1000 users, 5% conversion
n, p = 1000, 0.05
binom = stats.binom(n, p)

# 95% confidence interval for conversions
lower, upper = binom.ppf(0.025), binom.ppf(0.975)
print(f"95% CI: [{lower}, {upper}] conversions")

Practical Implications for A/B Testing

Non-Normal Data Handling

Data Type	Common Distribution	Recommended Approach
Revenue	Log-normal	Log-transform, then t-test
Counts	Poisson	Poisson regression or bootstrap
Proportions	Binomial	Chi-squared or Z-test for proportions
Time-to-event	Exponential/Weibull	Survival analysis
Heavy-tailed	Power law	Bootstrap, use medians

When to Transform Data

Log Transform: Right-skewed data (revenue, time metrics)
Square Root: Count data with Poisson-like properties
Box-Cox: General normalizing transformation

Bootstrap for Complex Metrics

When distribution is unclear or complex, use bootstrapping:

import numpy as np

def bootstrap_mean_ci(data, n_bootstrap=10000, ci=0.95):
    """Calculate bootstrap confidence interval for mean."""
    bootstrap_mean_samples = [np.mean(np.random.choice(data, len(data), replace=True)) 
                              for _ in range(n_bootstrap)]
    lower = np.percentile(bootstrap_mean_samples, (1-ci)/2 * 100)
    upper = np.percentile(bootstrap_mean_samples, (1+ci)/2 * 100)
    return lower, upper

Distribution Selection Decision Tree

Note: This is a simplified decision tree for initial guidance. Always complement these rules with domain knowledge, visual inspection of your data (histograms, Q-Q plots), and formal statistical tests for distribution fitting (e.g., Shapiro–Wilk test for normality, Kolmogorov–Smirnov test, Anderson–Darling test).

Is your data continuous?
├── YES: Is it always positive?
│   ├── YES: Is it right-skewed?
│   │   ├── YES → Consider Log-Normal or Exponential
│   │   └── NO → Consider Normal
│   └── NO: Can it be negative?
│       └── YES → Consider Normal
└── NO (Discrete): Is it a count?
    ├── YES: Is mean ≈ variance?
    │   ├── YES → Consider Poisson
    │   └── NO → Consider Negative Binomial
    └── NO: Is it binary (yes/no)?
        └── YES → Consider Binomial/Bernoulli

Summary Table: Distributions by Domain

Domain	Metric Example	Likely Distribution	Key Consideration
E-commerce	Revenue per user	Log-normal	Use geometric mean
Social Media	Followers count	Power law	Median more meaningful
SaaS	Daily active usage	Poisson	Check mean=variance
Fintech	Transaction amount	Log-normal	Heavy tail risk
Gaming	Session duration	Log-normal	Consider censoring
Healthcare	Test results	Normal (often)	Validate assumptions

Interview Tips

Always ask about the distribution before recommending a statistical test
Know when mean is misleading (power law, heavily skewed data)
Understand CLT - sample means become normal even if population isn’t
Be ready to suggest transformations for non-normal data
Bootstrapping is your friend when unsure about distribution

Real-World Data Distributions in Product Analytics

Why Distributions Matter

Common Distributions in Product Data

1. Normal (Gaussian) Distribution

2. Log-Normal Distribution

3. Poisson Distribution

4. Exponential Distribution

5. Power Law (Pareto/Zipf) Distribution

6. Binomial Distribution

Practical Implications for A/B Testing

Non-Normal Data Handling

When to Transform Data

Bootstrap for Complex Metrics

Distribution Selection Decision Tree

Summary Table: Distributions by Domain

Interview Tips

Additional Resources