Statistics & Probability

Your Guide To Statistical Mastery for Data Science

Overview

This section covers the fundamental concepts and skills required for a Data Science (Analytical) role at Meta. At Meta, Data Scientists play a crucial role in driving product development and business strategy through rigorous data analysis and statistical reasoning. This role is heavily focused on using statistical methods to understand user behavior, measure the impact of product changes, and inform data-driven decisions across Meta's vast ecosystem of products (Facebook, Instagram, WhatsApp, etc.).

Working at this scale, dealing with billions of users and petabytes of data, statistical rigor is paramount. Data Scientists at Meta are expected to design and analyze A/B tests to evaluate the impact of product changes, develop metrics and KPIs to track product performance and user engagement, build statistical models to predict user behavior and identify opportunities for improvement, and effectively communicate complex statistical findings to both technical and non-technical audiences. Therefore, a strong foundation in statistics and probability is absolutely essential.

What Can You Expect?

You can expect questions that not only test your knowledge of statistical concepts but also your ability to apply them to real-world product scenarios. Interviewers will be looking for your understanding of how to use data to answer business questions and drive product improvements. Expect questions on:

  • Descriptive statistics (mean, median, mode, variance, standard deviation): These form the basis for understanding data distributions and identifying key trends. Be prepared to calculate these metrics and explain their significance in a business context.
  • Probability distributions (normal, binomial, Poisson, exponential): Understanding these distributions is crucial for modeling various phenomena, such as user activity, event occurrences, and time-to-event analyses.
  • Hypothesis testing (A/B testing, t-tests, p-values, confidence intervals, statistical power): A/B testing is a cornerstone of product development at Meta. Be prepared to design A/B tests, calculate sample sizes, interpret p-values and confidence intervals, and understand the concept of statistical power.
  • Regression analysis (linear, logistic): Regression models are used to understand relationships between variables and predict outcomes. You should be comfortable with both linear and logistic regression and be able to interpret model coefficients and evaluate model performance.
  • Experimental design: Designing sound experiments is crucial for drawing valid conclusions from data. You should understand the principles of randomization, control groups, and how to minimize bias.
  • Bayes' theorem: Bayes' theorem is used to update probabilities based on new evidence. It's particularly relevant for problems involving classification, filtering, and prediction.

How to Prep

1. Descriptive Statistics

Explanation: Descriptive statistics summarize and describe the main features of a dataset. They provide a snapshot of the data's central tendency (where the data is centered) and dispersion (how spread out the data is). Key measures include:

  • Mean: The average value (sum of all values divided by the number of values). Formula: μ = Σx / n
  • Median: The middle value when the data is ordered. If there's an even number of values, the median is the average of the two middle values.
  • Mode: The most frequent value. A dataset can have multiple modes or no mode at all.
  • Variance: The average of the squared differences from the mean. Formula: σ² = Σ(x - μ)² / n
  • Standard Deviation: The square root of the variance, representing the typical deviation from the mean. Formula: σ = √σ²

These measures are crucial for understanding data distributions and identifying patterns or anomalies. For instance, comparing the mean and median can reveal skewness in the data. Standard deviation helps quantify the data's volatility or spread.

Wikipedia: Descriptive statistics

Practice Questions:

  1. You have website session durations (in seconds): 10, 15, 20, 20, 25, 30, 60. Calculate the mean, median, mode, variance, and standard deviation.
    • Mean: (10+15+20+20+25+30+60)/7 = 25.71
    • Median: 20
    • Mode: 20
    • Variance: Calculate the squared differences from the mean, sum them, and divide by 7. Result ~228.57
    • Standard Deviation: √228.57 ~ 15.12
  2. A product has daily active users (DAU) for a week: 1000, 1200, 1100, 1300, 1050, 950, 1150. Calculate the average DAU and the standard deviation. What does the standard deviation tell you about the DAU?
    • Average DAU: 1107.14
    • Standard Deviation: ~127.6
    • The standard deviation tells us about the variability or spread of the DAU around the average. A higher standard deviation indicates more fluctuation in DAU.
  3. Explain how outliers can affect the mean and median. Provide an example.
    • Outliers significantly affect the mean because the mean takes into account all values. However, the median is less sensitive to outliers as it only considers the middle value(s).
    • Example: Consider the dataset: 1, 2, 3, 4, 100. The mean is 22, while the median is 3. The outlier (100) drastically pulls the mean upwards but has no effect on the median.

2. Probability Distributions

Explanation: Probability distributions describe how values are distributed across a range. Understanding these distributions helps you model real-world phenomena and choose appropriate statistical tests.

Common Distributions:

  • Normal Distribution: Bell-shaped, symmetric. Many natural phenomena follow this distribution.
  • Binomial Distribution: Number of successes in n independent trials with probability p.
  • Poisson Distribution: Number of events occurring in a fixed interval of time/space.
  • Exponential Distribution: Time between events in a Poisson process.

Worked Examples:

  1. Normal Distribution: User session times on a website are normally distributed with mean μ = 5 minutes and standard deviation σ = 1.5 minutes. What's the probability a user spends more than 7 minutes?
    • Z-score = (7 - 5) / 1.5 = 1.33
    • P(Z > 1.33) = 1 - P(Z < 1.33) = 1 - 0.9082 = 0.0918 or 9.18%
  2. Binomial Distribution: In an A/B test, 40% of users in the control group convert. If you have 100 users, what's the probability exactly 35 convert?
    • P(X = 35) = C(100,35) × (0.4)^35 × (0.6)^65
    • This is approximately 0.028 or 2.8%
  3. Poisson Distribution: Customers arrive at a store at an average rate of 3 per hour. What's the probability exactly 5 arrive in the next hour?
    • P(X = 5) = e^(-3) × 3^5 / 5! = 0.0498 or 4.98%

3. Hypothesis Testing

Explanation: Hypothesis testing helps determine whether observed differences are statistically significant or due to chance. The process involves stating null and alternative hypotheses, choosing a significance level, calculating a test statistic, and making a decision.

Key Concepts:

  • Null Hypothesis (H₀): No difference exists
  • Alternative Hypothesis (H₁): A difference exists
  • p-value: Probability of observing the data (or more extreme) assuming H₀ is true
  • Significance Level (α): Threshold for rejecting H₀ (commonly 0.05)
  • Type I Error: Rejecting H₀ when it's true (false positive)
  • Type II Error: Failing to reject H₀ when it's false (false negative)

Worked Examples:

  1. A/B Test Analysis: You run an A/B test with 10,000 users per variant. Control converts at 4.2%, treatment at 4.8%. The p-value is 0.03. What do you conclude at α = 0.05?
    • Since p-value (0.03) < α (0.05), reject H₀
    • Conclusion: Treatment significantly outperforms control
    • Lift = (4.8% - 4.2%) / 4.2% = 14.3%
  2. Sample Size Calculation: You want to detect a 5% relative lift (from 10% to 10.5% conversion). What's the required sample size per variant for 80% power and α = 0.05?
    • Baseline conversion p₁ = 0.10
    • Expected conversion p₂ = 0.105
    • Use sample size formula: n = (Z_α/2 + Z_β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₂ - p₁)²
    • Approximately 15,000 users per variant
  3. Confidence Intervals: Your A/B test shows a 2% absolute lift with 95% CI of [1.2%, 2.8%]. How do you interpret this?
    • The true lift is likely between 1.2% and 2.8%
    • Since the CI doesn't include 0, the result is statistically significant
    • You can be 95% confident the true lift is at least 1.2%

4. Regression Analysis

Explanation: Regression analysis models relationships between variables. Linear regression assumes a linear relationship, while logistic regression is used for binary outcomes.

Linear Regression:

  • Equation: Y = β₀ + β₁X₁ + β₂X₂ + ... + ε
  • Coefficient Interpretation: β₁ represents the change in Y for a 1-unit increase in X₁
  • R-squared: Proportion of variance in Y explained by the model

Worked Examples:

  1. Simple Linear Regression: You model user engagement (Y) as a function of time spent on app (X). The equation is Y = 10 + 2X. Interpret the coefficients.
    • β₀ = 10: Expected engagement when time spent = 0
    • β₁ = 2: Each additional minute increases engagement by 2 units
  2. Multiple Regression: Predicting revenue with price and marketing spend. Revenue = 1000 + 50×Price - 2×Spend. R² = 0.75.
    • $50 increase in price → $50,000 more revenue
    • $1,000 more marketing spend → $2,000 less revenue
    • Model explains 75% of revenue variance

5. Experimental Design

Explanation: Good experimental design ensures valid conclusions. Key principles include randomization, control groups, and adequate sample sizes.

Key Principles:

  • Randomization: Randomly assign users to treatment/control to eliminate bias
  • Control Group: Baseline for comparison
  • Blinding: Participants don't know their group assignment
  • Sample Size: Large enough to detect meaningful effects

Worked Examples:

  1. A/B Test Design: Testing a new checkout flow. How would you design this experiment?
    • Metric: Conversion rate (primary), revenue per user (secondary)
    • Randomization: Randomly assign users at page load
    • Sample Size: Calculate based on expected effect size
    • Duration: 1-2 weeks to capture weekly patterns
    • Analysis: Compare means, check for significance and practical importance
  2. Common Pitfalls: What could go wrong with this experiment?
    • Novelty Effect: Users react differently to new features initially
    • Seasonal Effects: Holiday traffic patterns affect results
    • Multiple Testing: Running many tests increases false positive risk
    • Sample Ratio Mismatch: Unequal group sizes reduce statistical power

6. Bayes' Theorem

Explanation: Bayes' theorem updates probabilities based on new evidence. It's fundamental for understanding conditional probability and is used in spam filtering, medical testing, and A/B test analysis.

Formula:

P(A|B) = [P(B|A) × P(A)] / P(B)

Worked Examples:

  1. Medical Testing: A disease affects 1% of the population. Test is 99% accurate. If you test positive, what's the probability you have the disease?
    • P(Disease) = 0.01, P(Positive|Disease) = 0.99, P(Positive|No Disease) = 0.01
    • P(Disease|Positive) = (0.99 × 0.01) / [(0.99 × 0.01) + (0.01 × 0.99)] = 0.5 or 50%
    • Even with a positive test, only 50% chance of having the disease!
  2. A/B Test with Prior Knowledge: You have historical data showing 10% of feature changes are successful. Your current test shows p = 0.04. How does this update your belief?
    • Prior P(Success) = 0.10
    • Likelihood P(p=0.04|Success) based on historical distribution
    • Posterior probability combines prior belief with new evidence

Test Your Knowledge

🧠 Statistics & Probability Quiz

Test your understanding of statistical concepts, distributions, and hypothesis testing.

1 What is the Central Limit Theorem?

2 In hypothesis testing, what is a Type I error?

3 What p-value threshold is commonly used for statistical significance?

4 What does a 95% confidence interval mean?

5 Which distribution is most appropriate for modeling the number of events in a fixed interval?

45 mins Beginner