Master Product Data Analytics

Your Guide To Data Analytics Mastery

II. Foundational Knowledge & Skills (The Building Blocks)

1. Statistics & Probability

Alright, let's dive into the world of statistics and probability! 📊 Don't worry, we're not going to get bogged down in complex formulas without understanding their meaning. The goal here is to truly master these concepts so that you can analyze data, interpret results, and make sound judgments. Remember, at Meta, data is king 👑, and your ability to wield these statistical tools will be critical to your success. This section is broken down into different areas that will be the focus of analytical interviews and on the job work for Meta Data Scientists, and we have included practical examples to test your knowledge, with full solutions. We are also going to cover the mathematical foundations to make sure you have a deep, solid understanding. Let's get started!

1.1 Descriptive Statistics (Understanding the Data)

First up, descriptive statistics. This is where we roll up our sleeves and get to know our data. We're talking about summarizing and describing the main features of a dataset. 📊 Think of it as getting a "lay of the land" before you start building anything fancy. Here's what we'll cover:


  • 1.1.1 Measures of Central Tendency (Mean, Median, Mode)

    These are your go-to stats for understanding the "typical" value in your data. We'll talk about when to use each one, and why the mean isn't always the best choice (especially with skewed data! 😉).

    • Mean (Arithmetic Mean):

      The mean, often denoted as \(\mu\) for a population and \(\bar{x}\) for a sample, is the sum of all values divided by the number of values.

      Formula:

      \[ \mu = \frac{\sum_{i=1}^{n} x_i}{n} \]

      where \(x_i\) represents each value in the dataset and \(n\) is the total number of values.

      When to use: The mean is best used when data is normally distributed or when the distribution is not heavily skewed. It is sensitive to outliers.

      Wikipedia: Mean, Wolfram MathWorld: Mean

    • Median:

      The median is the middle value in an ordered dataset. It divides the data into two equal halves.

      How to calculate:

      1. Arrange the data in ascending order.
      2. If \(n\) is odd, the median is the value at the \(\frac{n+1}{2}\) position.
      3. If \(n\) is even, the median is the average of the values at the \(\frac{n}{2}\) and \(\frac{n}{2} + 1\) positions.

      When to use: The median is robust to outliers and skewed distributions, making it a better measure of central tendency in such cases.

      Wikipedia: Median, Wolfram MathWorld: Median

    • Mode:

      The mode is the value that appears most frequently in a dataset.

      When to use: The mode is particularly useful for categorical data or when identifying the most common value in a dataset.

      Wikipedia: Mode, Wolfram MathWorld: Mode

  • 1.1.2 Measures of Dispersion (Variance, Standard Deviation, Range, IQR)

    How spread out is your data? Are all the data points clustered together, or are they all over the place? These measures help us quantify that spread.

    • Variance:

      Variance measures the average squared deviation of each data point from the mean. It is denoted as \(\sigma^2\) for a population and \(s^2\) for a sample.

      Formula:

      \[ \sigma^2 = \frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n} \]

      where \(x_i\) is each value, \(\mu\) is the mean, and \(n\) is the number of values.

      Wikipedia: Variance, Wolfram MathWorld: Variance

    • Standard Deviation:

      The standard deviation (\(\sigma\) for a population, \(s\) for a sample) is the square root of the variance. It measures the average amount of variation or dispersion from the mean in the original units of the data.

      Formula:

      \[ \sigma = \sqrt{\sigma^2} \]

      Use: Along with the mean, the standard deviation helps to understand the spread of data in a normal distribution.

      Wikipedia: Standard Deviation, Wolfram MathWorld: Standard Deviation

    • Range:

      The range is the difference between the maximum and minimum values in a dataset.

      Formula:

      \[ \text{Range} = \text{max}(x_i) - \text{min}(x_i) \]

      Use: The range provides a quick, rough estimate of the spread but is sensitive to outliers.

      Wikipedia: Range, Wolfram MathWorld: Range

    • Interquartile Range (IQR):

      The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile). It represents the spread of the middle 50% of the data.

      Formula:

      \[ \text{IQR} = Q_3 - Q_1 \]

      where \(Q_3\) is the third quartile and \(Q_1\) is the first quartile.

      Use: The IQR is robust to outliers and is particularly useful for skewed distributions.

      Wikipedia: Interquartile Range, Wolfram MathWorld: Interquartile Range

  • 1.1.3 Data Distributions and Visualization (Histograms, Box Plots)

    Sometimes, a picture is worth a thousand numbers. We'll look at how to visualize data distributions using histograms and box plots, so you can quickly grasp the shape and characteristics of your data.

    • Histograms:

      Histograms display the distribution of a dataset by dividing the data into bins and showing the frequency or count of data points in each bin.

      Use: They help visualize the shape of the distribution (e.g., normal, skewed, bimodal) and identify the range of values where most data points fall.

      Wikipedia: Histogram, Wolfram MathWorld: Histogram

    • Box Plots:

      Box plots provide a visual summary of the distribution, showing the median, quartiles, and potential outliers.

      Components:

      • The box represents the interquartile range (IQR), with the median marked inside.
      • Whiskers extend to the farthest data points within 1.5 times the IQR from the box edges.
      • Points beyond the whiskers are considered potential outliers.

      Use: Box plots are useful for comparing distributions across different groups and identifying the presence of outliers.

      Wikipedia: Box Plot, Wolfram MathWorld: Box Plot

  • 1.1.4 Skewness and Kurtosis:

    These are fancy words for describing the asymmetry and "tailedness" of a distribution. We'll break them down and see why they matter.

    • Skewness:

      Skewness measures the asymmetry of a distribution. A distribution is skewed if one tail is longer than the other.

      • Positive Skew (Right Skew): The right tail is longer; the mass of the distribution is concentrated on the left. The mean is typically greater than the median.
      • Negative Skew (Left Skew): The left tail is longer; the mass of the distribution is concentrated on the right. The mean is typically less than the median.

      Formula:

      \[ \text{Skewness} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^3 / n}{s^3} \]

      where \(\bar{x}\) is the sample mean, \(s\) is the sample standard deviation, and \(n\) is the number of values.

      Wikipedia: Skewness, Wolfram MathWorld: Skewness

    • Kurtosis:

      Kurtosis measures the "tailedness" of a distribution, or how much data is in the tails compared to a normal distribution.

      • High Kurtosis: Heavy tails, indicating more outliers or extreme values.
      • Low Kurtosis: Light tails, indicating fewer outliers.

      Formula:

      \[ \text{Kurtosis} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^4 / n}{s^4} - 3 \]

      (Note: The -3 is often included to make the kurtosis of a normal distribution equal to 0.)

      Wikipedia: Kurtosis, Wolfram MathWorld: Kurtosis


2.Probability

2 Probability (Quantifying Uncertainty)

Probability is the bedrock of statistical inference. It's the language we use to talk about uncertainty, and it's essential for making informed decisions in the face of incomplete information. Don't worry, we'll keep it practical and relevant to the kinds of problems you'll encounter at Meta. 👍


  • 2.1 Basic Probability Concepts
    • Sample Spaces, Events, Outcomes:

      We'll start with the fundamentals. An outcome is a single possible result of an experiment. The sample space is the set of all possible outcomes. An event is a subset of the sample space, or a collection of one or more outcomes.

      Example:

      • Experiment: Rolling a six-sided die.
      • Sample space: {1, 2, 3, 4, 5, 6}
      • Event: Rolling an even number (outcomes: 2, 4, 6)

      Wikipedia: Sample Space, Wikipedia: Event, Wolfram MathWorld: Outcome

    • Probability Axioms:

      These are the basic rules that govern probability. They might seem obvious, but they're important to keep in mind.

      1. The probability of any event is a non-negative number between 0 and 1, inclusive.
      2. The probability of the entire sample space is 1.
      3. If two events are mutually exclusive (they cannot both occur at the same time), the probability of either event occurring is the sum of their individual probabilities.

      Wikipedia: Probability Axioms, Wolfram MathWorld: Probability Axioms

    • Calculating Probabilities (Classical, Frequentist, Subjective):

      We'll look at different ways to calculate probabilities, depending on the situation.

      • Classical: Based on counting equally likely outcomes (e.g., rolling a die).
      • Frequentist: Based on the long-run frequency of an event occurring (e.g., observing many coin flips).
      • Subjective: Based on personal beliefs or judgments (e.g., assigning a probability to a new product launch being successful).

      Wikipedia: Probability Interpretations

  • 2.2 Conditional Probability and Independence
    • Defining Conditional Probability:

      This is the probability of an event happening *given* that another event has already occurred. It's a crucial concept for understanding how events relate to each other. We denote the probability of event A happening given that event B has happened as \(P(A|B)\).

      Formula:

      \[ P(A|B) = \frac{P(A \cap B)}{P(B)} \]

      where \(P(A \cap B)\) is the probability of both A and B happening, and \(P(B)\) is the probability of B happening.

      Wikipedia: Conditional Probability, Wolfram MathWorld: Conditional Probability

    • The Multiplication Rule:

      This rule helps us calculate the probability of two events happening together. It is derived from the definition of conditional probability.

      Formula:

      \[ P(A \cap B) = P(A|B) * P(B) \]

      or

      \[ P(A \cap B) = P(B|A) * P(A) \]

      Wikipedia: Multiplication Rule

    • Independent vs. Dependent Events:

      Two events are independent if the occurrence of one does not affect the probability of the other. They are dependent if the occurrence of one does affect the probability of the other.

      For independent events:

      \[ P(A|B) = P(A) \]

      \[ P(B|A) = P(B) \]

      \[ P(A \cap B) = P(A) * P(B) \]

      Wikipedia: Independence, Wolfram MathWorld: Independent Events

    • Real-world examples: If a user clicks on an ad (Event A), what's the probability they'll make a purchase (Event B)? This is a classic example of conditional probability in action.
  • 2.3 Bayes' Theorem (Updating Beliefs with Data)
    • Prior and Posterior Probabilities:

      Bayes' Theorem provides a way to update our beliefs in light of new evidence. The prior probability is our initial belief about an event before observing any data. The posterior probability is our updated belief after observing the data.

    • Likelihood:

      This is the probability of observing the data given a particular hypothesis.

    • Bayes' Theorem Formula:

      \[ P(A|B) = \frac{P(B|A) * P(A)}{P(B)} \]

      Where:

      • \(P(A|B)\) is the posterior probability of A given B.
      • \(P(B|A)\) is the likelihood of B given A.
      • \(P(A)\) is the prior probability of A.
      • \(P(B)\) is the probability of B.

      In words: The posterior probability of A given B is proportional to the likelihood of B given A multiplied by the prior probability of A.

      Wikipedia: Bayes' Theorem, Wolfram MathWorld: Bayes' Theorem

    • Applications in Spam Filtering, Medical Diagnosis, and A/B Testing:

      Bayes' Theorem has a wide range of applications. For example, in spam filtering, we can use it to update our belief that an email is spam given the words it contains. In A/B testing we can use it to update our belief a new feature is better based on observed data.

    • Worked-out examples using Bayes' Theorem: We'll walk through some examples in later sections to solidify your understanding.
  • 2.4 Random Variables
    • Discrete vs. Continuous Random Variables:

      A random variable is a variable whose value is a numerical outcome of a random phenomenon. A discrete random variable has a countable number of possible values (e.g., number of clicks, number of likes). A continuous random variable can take on any value within a given range (e.g., time spent on a page, height, weight).

      Wikipedia: Random Variable, Wolfram MathWorld: Random Variable

    • Probability Mass Functions (PMFs):

      These describe the probability distribution of a discrete random variable. The PMF gives the probability that the random variable takes on a specific value.

      Example: For a fair six-sided die, the PMF is \(P(X=k) = 1/6\) for \(k = 1, 2, 3, 4, 5, 6\).

      Wikipedia: Probability Mass Function, Wolfram MathWorld: Probability Mass Function

    • Probability Density Functions (PDFs):

      These describe the probability distribution of a continuous random variable. The probability that the random variable falls within a particular range is given by the area under the PDF curve over that range.

      Example: The standard normal distribution has a bell-shaped PDF.

      Wikipedia: Probability Density Function, Wolfram MathWorld: Probability Density Function

    • Cumulative Distribution Functions (CDFs):

      The CDF gives the probability that a random variable (discrete or continuous) is less than or equal to a certain value.

      Formula (for continuous random variables):

      \[ F(x) = P(X \le x) = \int_{-\infty}^{x} f(t) dt \]

      where \(f(t)\) is the PDF.

      Wikipedia: Cumulative Distribution Function, Wolfram MathWorld: Distribution Function

    • Expectation and Variance of Random Variables:

      The expectation (or expected value) of a random variable is its average value, weighted by the probabilities of each outcome. The variance measures the spread or dispersion of the random variable around its expected value.

      Formula (for discrete random variables):

      \[ E(X) = \sum_{i} x_i * P(X = x_i) \]

      \[ Var(X) = E[(X - E(X))^2] = E(X^2) - [E(X)]^2 \]

      Wikipedia: Expected Value, Wikipedia: Variance, Wolfram MathWorld: Expected Value, Wolfram MathWorld: Variance


3. Probability Distributions

Probability distributions are mathematical functions that describe the likelihood of different outcomes in a random experiment. They're essential for modeling real-world phenomena and making predictions. Here are some of the most important ones:


  • 3.1 Discrete Distributions
    • Bernoulli Distribution

      Models a single trial with two possible outcomes (success or failure), each with a fixed probability.

      Parameters:

      • \(p\): Probability of success (0 ≤ \(p\) ≤ 1)

      Probability Mass Function (PMF):

      \[ P(X=k) = \begin{cases} p & \text{if } k=1 \text{ (success)} \\ 1-p & \text{if } k=0 \text{ (failure)} \end{cases} \]

      Example: A single coin flip (Heads = success, Tails = failure).

      Expectation: \(E(X) = p\)

      Variance: \(Var(X) = p(1-p)\)

      Wikipedia: Bernoulli Distribution, Wolfram MathWorld: Bernoulli Distribution

    • Binomial Distribution

      Models the number of successes in a fixed number of independent Bernoulli trials.

      Parameters:

      • \(n\): Number of trials
      • \(p\): Probability of success on each trial

      Probability Mass Function (PMF):

      \[ P(X=k) = \binom{n}{k} p^k (1-p)^{n-k} \]

      where \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\) is the binomial coefficient.

      Example: The number of heads in 10 coin flips.

      Expectation: \(E(X) = np\)

      Variance: \(Var(X) = np(1-p)\)

      Wikipedia: Binomial Distribution, Wolfram MathWorld: Binomial Distribution

    • Poisson Distribution

      Models the number of events occurring in a fixed interval of time or space, given a constant average rate of occurrence.

      Parameter:

      • \(\lambda\): The average rate of events per interval (\(\lambda\) > 0)

      Probability Mass Function (PMF):

      \[ P(X=k) = \frac{e^{-\lambda} \lambda^k}{k!} \]

      Example: The number of users visiting a website per hour.

      Expectation: \(E(X) = \lambda\)

      Variance: \(Var(X) = \lambda\)

      Wikipedia: Poisson Distribution, Wolfram MathWorld: Poisson Distribution

    • Geometric Distribution

      Models the number of independent Bernoulli trials needed to get the first success.

      Parameter:

      • \(p\): Probability of success on each trial (0 < \(p\) ≤ 1)

      Probability Mass Function (PMF):

      \[ P(X=k) = (1-p)^{k-1} p \]

      Example: The number of ad impressions until a user clicks on an ad.

      Expectation: \(E(X) = \frac{1}{p}\)

      Variance: \(Var(X) = \frac{1-p}{p^2}\)

      Wikipedia: Geometric Distribution, Wolfram MathWorld: Geometric Distribution

  • 3.2 Continuous Distributions
    • Normal (Gaussian) Distribution

      A bell-shaped distribution that is symmetric around the mean. Many natural phenomena and measurement errors follow a normal distribution. It is also the limiting distribution of the sum of many independent random variables (Central Limit Theorem).

      Parameters:

      • \(\mu\): Mean (center of the distribution)
      • \(\sigma\): Standard deviation (spread of the distribution)

      Probability Density Function (PDF):

      \[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

      Standard Normal Distribution: A special case where \(\mu = 0\) and \(\sigma = 1\). Any normal distribution can be transformed into a standard normal distribution by standardizing: \(Z = \frac{X - \mu}{\sigma}\)

      Example: Heights of people, measurement errors.

      Expectation: \(E(X) = \mu\)

      Variance: \(Var(X) = \sigma^2\)

      Wikipedia: Normal Distribution, Wolfram MathWorld: Normal Distribution

    • Exponential Distribution

      Models the time until an event occurs, assuming a constant rate of occurrence. Often used in reliability analysis and queuing theory.

      Parameter:

      • \(\lambda\): Rate parameter (\(\lambda\) > 0), which is the average number of events per unit of time.

      Probability Density Function (PDF):

      \[ f(x) = \lambda e^{-\lambda x} \text{ for } x \ge 0 \]

      Example: Time until a user churns, time until the next customer arrives.

      Expectation: \(E(X) = \frac{1}{\lambda}\)

      Variance: \(Var(X) = \frac{1}{\lambda^2}\)

      Wikipedia: Exponential Distribution, Wolfram MathWorld: Exponential Distribution

    • Uniform Distribution

      All values within a given range are equally likely.

      Parameters:

      • \(a\): Lower bound of the range
      • \(b\): Upper bound of the range

      Probability Density Function (PDF):

      \[ f(x) = \begin{cases} \frac{1}{b-a} & \text{for } a \le x \le b \\ 0 & \text{otherwise} \end{cases} \]

      Example: Random number generation between 0 and 1.

      Expectation: \(E(X) = \frac{a+b}{2}\)

      Variance: \(Var(X) = \frac{(b-a)^2}{12}\)

      Wikipedia: Continuous Uniform Distribution, Wolfram MathWorld: Uniform Distribution

  • 3.3 Key Theorems
    • Law of Large Numbers:

      This theorem states that as you repeat an experiment many times, the average of the results will converge to the expected value. In simpler terms, the more data you collect, the closer your sample average will be to the true population average.

      Practical Implication: When making decisions based on data, it's important to have a sufficiently large sample size to ensure that your results are reliable.

      Wikipedia: Law of Large Numbers, Wolfram MathWorld: Law of Large Numbers

    • Central Limit Theorem:

      One of the most important theorems in statistics! It states that the sum (or average) of a large number of independent and identically distributed random variables will be approximately normally distributed, regardless of the underlying distribution of the individual variables.

      Practical Implication: This theorem allows us to use statistical methods that assume normality (like t-tests and confidence intervals) even when the original data is not normally distributed, as long as the sample size is large enough. This is why the normal distribution is so important in statistical inference.

      Wikipedia: Central Limit Theorem, Wolfram MathWorld: Central Limit Theorem

    • Understanding the relationship between distributions:

      Many distributions are related to each other. For example:

      • The binomial distribution can be approximated by the normal distribution when the number of trials is large and the probability of success is not too close to 0 or 1.
      • The sum of independent Poisson random variables also follows a Poisson distribution.
      • The exponential distribution is the continuous counterpart of the geometric distribution.

      Understanding these relationships can help you choose the appropriate distribution for a given problem and make connections between different statistical concepts.


4. Hypothesis Testing

Hypothesis testing is a formal procedure for using data to evaluate the validity of a claim about a population. It's a cornerstone of statistical inference and a critical tool for data scientists at Meta. Let's break down the key concepts:


  • 4.1 Null and Alternative Hypotheses

    Every hypothesis test has two competing hypotheses:

    • Null Hypothesis (\(H_0\)): This is the statement of "no effect" or "no difference." It's the status quo assumption that we're trying to disprove.
    • Alternative Hypothesis (\(H_1\) or \(H_a\)): This is the statement that we're trying to find evidence for. It's the opposite of the null hypothesis.

    Example:

    • \(H_0\): The new ad campaign has no effect on click-through rates.
    • \(H_1\): The new ad campaign increases click-through rates.

    Wikipedia: Statistical Hypothesis Testing, Wolfram MathWorld: Hypothesis Testing

  • 4.2 Type I and Type II Errors

    In hypothesis testing, there are two types of errors we can make:

    • Type I Error (False Positive): Rejecting the null hypothesis when it is actually true. The probability of making a Type I error is denoted by \(\alpha\) (alpha) and is called the significance level.
    • Type II Error (False Negative): Failing to reject the null hypothesis when it is actually false. The probability of making a Type II error is denoted by \(\beta\) (beta).

    Analogy: Think of a smoke detector. A Type I error is a false alarm (the alarm goes off when there's no fire), and a Type II error is a missed alarm (the alarm doesn't go off when there is a fire).

    Trade-off: There's a trade-off between Type I and Type II errors. Decreasing the probability of one type of error generally increases the probability of the other.

    Wikipedia: Type I and Type II Errors, Wolfram MathWorld: Type I Error, Wolfram MathWorld: Type II Error

  • 4.3 p-values and Statistical Significance

    The p-value is the probability of observing data as extreme as, or more extreme than, the actual data collected, assuming that the null hypothesis is true. It's a measure of how much evidence we have against the null hypothesis.

    Interpretation:

    • A small p-value (typically less than \(\alpha\), which is often set to 0.05) suggests that the observed data is unlikely to have occurred by chance alone if the null hypothesis were true. This provides evidence against the null hypothesis.
    • A large p-value suggests that the observed data is consistent with the null hypothesis.

    Decision Rule:

    • If p-value ≤ \(\alpha\): Reject the null hypothesis.
    • If p-value > \(\alpha\): Fail to reject the null hypothesis.

    Important Note: The p-value is NOT the probability that the null hypothesis is true. It's the probability of observing the data (or more extreme data) given that the null hypothesis is true.

    Wikipedia: P-value, Wolfram MathWorld: P-value

  • 4.4 Confidence Intervals

    A confidence interval provides a range of plausible values for a population parameter (e.g., the population mean, the difference in means between two groups). It's constructed in such a way that we have a certain level of confidence (e.g., 95%) that the true parameter lies within the interval.

    Interpretation: A 95% confidence interval means that if we were to repeat the experiment many times and construct a confidence interval each time, about 95% of those intervals would contain the true population parameter.

    Relationship to Hypothesis Testing: If the null hypothesis value of the parameter falls outside the confidence interval, we can reject the null hypothesis at the corresponding significance level.

    Wikipedia: Confidence Interval, Wolfram MathWorld: Confidence Interval

  • 4.5 Statistical Power and Sample Size Determination

    Statistical power is the probability of correctly rejecting the null hypothesis when it is false (i.e., the probability of avoiding a Type II error). It depends on the sample size, the effect size, the significance level, and the variability of the data.

    Sample Size Determination: Before conducting an experiment, it's important to determine the required sample size to achieve a desired level of power. This ensures that the study has a good chance of detecting a meaningful effect if it exists.

    Factors Affecting Power:

    • Sample Size: Larger sample size = more power.
    • Effect Size: Larger effect size (i.e., a bigger difference between groups or a stronger relationship between variables) = more power.
    • Significance Level (\(\alpha\)): Higher \(\alpha\) = more power (but also a higher risk of Type I error).
    • Variability of Data: Lower variability = more power.

    Wikipedia: Statistical Power, Wolfram MathWorld: Statistical Power

  • 4.6 Common Hypothesis Tests

    Here are some commonly used hypothesis tests:

    • t-tests (Comparing Means)

      Used to compare the means of two groups. There are different types of t-tests depending on whether the samples are independent or paired and whether the variances are assumed to be equal.

      • Independent Samples t-test: Used when the two groups are independent of each other (e.g., comparing the average time spent on a website between users who see ad A and users who see ad B).
      • Paired Samples t-test: Used when the two groups are dependent or paired (e.g., comparing the average blood pressure of patients before and after taking a medication).
      • One-Sample t-test: Used to compare the mean of a single group to a known or hypothesized value.

      Assumptions:

      • The data should be approximately normally distributed within each group.
      • The variances of the two groups should be approximately equal (for independent samples t-test).

      Test Statistic:

      \[ t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \]

      where \(\bar{x}_1\) and \(\bar{x}_2\) are the sample means, \(s_p\) is the pooled standard deviation, and \(n_1\) and \(n_2\) are the sample sizes.

      Wikipedia: Student's t-test, Wolfram MathWorld: Student's t-Test

    • Chi-squared Tests (Analyzing Categorical Data)

      Used to determine if there's a relationship between two categorical variables.

      • Chi-squared Test of Independence: Tests whether two categorical variables are independent or associated.
      • Chi-squared Goodness-of-Fit Test: Tests whether a sample distribution matches a hypothesized distribution.

      Test Statistic:

      \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \]

      where \(O_i\) is the observed frequency and \(E_i\) is the expected frequency for each category.

      Wikipedia: Chi-squared Test, Wolfram MathWorld: Chi-Squared Test

    • ANOVA (Comparing Means Across Multiple Groups)

      Used to compare the means of more than two groups. It determines whether there are any statistically significant differences between the group means.

      Assumptions:

      • The data should be approximately normally distributed within each group.
      • The variances of the groups should be approximately equal.
      • The observations should be independent.

      Test Statistic:

      \[ F = \frac{\text{Variance between groups}}{\text{Variance within groups}} \]

      Wikipedia: Analysis of Variance (ANOVA), Wolfram MathWorld: Analysis of Variance


5. Regression Analysis

Regression analysis is a set of statistical methods used to model the relationship between a dependent variable and one or more independent variables. It's a powerful tool for prediction and understanding causal relationships.


  • 5.1 Simple Linear Regression

    This is the simplest form of regression, where we model the relationship between two variables with a straight line.

    Model:

    \[ y = \beta_0 + \beta_1 x + \epsilon \]

    where:

    • \(y\) is the dependent variable (the outcome we're trying to predict).
    • \(x\) is the independent variable (the predictor variable).
    • \(\beta_0\) is the y-intercept (the value of \(y\) when \(x\) is 0).
    • \(\beta_1\) is the slope (the change in \(y\) for a one-unit increase in \(x\)).
    • \(\epsilon\) is the error term (the difference between the actual value of \(y\) and the predicted value).

    Example: Predicting ice cream sales based on temperature.

    Estimation: The most common method for estimating the coefficients (\(\beta_0\) and \(\beta_1\)) is Ordinary Least Squares (OLS), which minimizes the sum of the squared differences between the actual and predicted values of \(y\).

    Wikipedia: Simple Linear Regression, Wolfram MathWorld: Least Squares Fitting

  • 5.2 Multiple Linear Regression

    This is an extension of simple linear regression where we have more than one independent variable.

    Model:

    \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon \]

    where we have \(p\) independent variables.

    Example: Predicting house prices based on square footage, number of bedrooms, and location.

    Estimation: Similar to simple linear regression, we use OLS to estimate the coefficients.

    Wikipedia: Linear Regression, Wolfram MathWorld: Least Squares Fitting

  • 5.3 Model Evaluation (R-squared, RMSE, MAE)

    How do we know if our regression model is any good? Here are some common metrics:

    • R-squared (\(R^2\)): Measures the proportion of the variance in the dependent variable that is explained by the independent variables. It ranges from 0 to 1, with higher values indicating a better fit.
    • Root Mean Squared Error (RMSE): Measures the average difference between the actual and predicted values of the dependent variable. It has the same units as the dependent variable.
    • Mean Absolute Error (MAE): Measures the average absolute difference between the actual and predicted values. It is less sensitive to outliers than RMSE.

    Wikipedia: Coefficient of Determination, Wikipedia: Root Mean Square Deviation, Wikipedia: Mean Absolute Error

  • 5.4 Logistic Regression (for Binary Outcomes)

    What if our dependent variable is binary (e.g., click/no click, convert/no convert)? That's where logistic regression comes in.

    Model:

    \[ p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + ... + \beta_p x_p)}} \]

    where \(p\) is the probability of the outcome (e.g., the probability of a click), and the right-hand side is the logistic function.

    Example: Predicting whether a user will click on an ad based on their age, gender, and interests.

    Estimation: We use Maximum Likelihood Estimation (MLE) to estimate the coefficients.

    Wikipedia: Logistic Regression, Wolfram MathWorld: Logistic Equation

  • 5.5 Interpreting Regression Coefficients

    The coefficients in a regression model tell us the change in the dependent variable associated with a one-unit change in the corresponding independent variable, holding all other independent variables constant.

    Example: In a simple linear regression model predicting house prices based on square footage, a coefficient of 100 for square footage would mean that, on average, each additional square foot is associated with a $100 increase in price.

    Important Note: The interpretation of coefficients can be more complex in multiple linear regression, as we need to consider the relationships between the independent variables.


6. Experimental Design / Causal Inference

Experimental design is the process of planning an experiment to test a hypothesis in a way that allows us to draw causal conclusions. It's crucial for establishing whether a change in one variable really causes a change in another, and not just that they are associated or correlated. In this section we will look into the key ideas behind causal inference, focusing on how to design experiments that allow us to attribute effects to their causes with confidence.


  • 6.1 A/B Testing Fundamentals

    A/B testing is the most powerful and widely used method for causal inference in tech. The idea is to randomly assign users to different versions of a product or feature (A and B), and then compare their behavior. By randomizing, we aim to ensure that the groups are similar in all respects except for the treatment, allowing us to attribute any observed differences to the treatment itself.

    Key Components:

    • Control Group (A): The group that receives the existing version or no treatment.
    • Treatment Group (B): The group that receives the new version or treatment.
    • Randomization: Assigning users to groups randomly to ensure that the groups are comparable.
    • Outcome Variable: The metric you're measuring (e.g., click-through rate, conversion rate, time spent).
    • Hypothesis: A statement about the expected effect of the treatment.

    Example: Testing a new button color on a website to see if it increases click-through rates.

    Wikipedia: A/B Testing

  • 6.2 Randomization and Control Groups

    Randomization is the cornerstone of experimental design. By randomly assigning units (e.g., users, sessions) to treatment and control groups, we aim to create groups that are statistically equivalent in all respects except for the treatment. This helps us isolate the effect of the treatment.

    Control Group: The control group serves as a baseline for comparison. It allows us to estimate what would have happened in the absence of the treatment. It is important that the control group is identical to the treatment group in all ways apart from not being exposed to the treatment. This is the reason why randomization is so important.

    Why Randomization is Important:

    • Minimizes bias by ensuring that known and unknown confounding factors are evenly distributed across groups.
    • Allows us to make causal inferences about the effect of the treatment.
  • 6.3 Confounding Variables and Bias

    Confounding variables are variables that are associated with both the treatment and the outcome. If not accounted for, they can bias the results and lead to incorrect conclusions. Randomization helps to mitigate the influence of confounding variables, but it's important to be aware of potential confounders and control for them if possible.

    Example: Suppose you're testing a new educational program, and students who volunteer for the program (treatment group) tend to be more motivated than those who don't (control group). Motivation is a confounding variable because it's associated with both the treatment (volunteering) and the outcome (academic performance).

    Types of Bias:

    • Selection Bias: Occurs when the treatment and control groups are not comparable at baseline.
    • Omitted Variable Bias: Occurs when a confounding variable is not included in the analysis.
    • Measurement Bias: Occurs when the outcome variable is measured differently between the treatment and control groups.
  • 6.4 Beyond A/B Testing: Quasi-Experiments and Observational Studies

    Sometimes, a true randomized experiment (like an A/B test) is not feasible or ethical. In these cases, we can use quasi-experimental or observational methods to try to estimate causal effects.

    • Quasi-Experiments: These are studies where the treatment assignment is not completely random, but there is some "as-if" random variation that can be exploited. Examples include:
      • Regression Discontinuity: Comparing units just above and below a cutoff threshold for treatment assignment.
      • Difference-in-Differences: Comparing changes in outcomes over time between a treatment group and a control group that are assumed to have parallel trends in the absence of treatment.
      • Instrumental Variables: Using a variable that affects the treatment but not the outcome directly to isolate the causal effect of the treatment.
    • Observational Studies: These are studies where the researcher does not control the treatment assignment. Instead, they observe the relationship between the treatment and the outcome, while trying to control for confounding variables. Examples include:
      • Matching: Creating a control group that is similar to the treatment group based on observed characteristics.
      • Regression Analysis: Using statistical models to control for confounding variables.

    Limitations: Causal inference is more challenging in quasi-experiments and observational studies because it's harder to rule out confounding variables. These approaches often require making additional assumptions.

    Wikipedia: Quasi-Experiment, Wikipedia: Observational Study

  • 6.5 Ethical Considerations in Experimentation

    When conducting experiments with human subjects, it's important to consider ethical principles such as:

    • Informed Consent: Participants should be informed about the experiment and agree to participate.
    • Beneficence: The potential benefits of the experiment should outweigh the risks to participants.
    • Justice: The benefits and risks of the experiment should be fairly distributed among the population.
    • Data Privacy and Security: Protecting the privacy and security of participants' data is of utmost importance.

    Example: When testing a new feature on a social media platform, it's important to consider the potential impact on users' mental health, privacy, and well-being.

    Wikipedia: Research Ethics


7. Bayesian Methods

7 Bayesian Methods

Bayesian methods offer an alternative approach to statistical inference that is based on updating prior beliefs with new data. Unlike frequentist methods, which focus on long-run frequencies, Bayesian methods treat probabilities as degrees of belief.


  • 7.1 Bayes' Theorem

    We've already touched on Bayes' Theorem, but let's revisit it in more detail. It's the fundamental equation that underlies Bayesian inference.

    Formula:

    \[ P(A|B) = \frac{P(B|A) * P(A)}{P(B)} \]

    Components:

    • \(P(A|B)\): Posterior probability - the probability of event A given that event B has occurred.
    • \(P(B|A)\): Likelihood - the probability of observing event B given that event A is true.
    • \(P(A)\): Prior probability - the initial probability of event A.
    • \(P(B)\): Evidence - the probability of event B occurring.

    In words: The posterior probability of A given B is proportional to the likelihood of B given A multiplied by the prior probability of A.

    Wikipedia: Bayes' Theorem, Wolfram MathWorld: Bayes' Theorem

  • 7.2 Prior and Posterior Distributions

    In Bayesian inference, we represent our uncertainty about parameters using probability distributions.

    • Prior Distribution: This represents our beliefs about a parameter *before* observing any data. It can be based on previous studies, expert opinion, or other sources of information. The prior distribution is denoted as \(P(\theta)\), where \(\theta\) is the parameter of interest.
    • Posterior Distribution: This represents our updated beliefs about the parameter *after* observing the data. It combines the information from the prior distribution and the likelihood function. The posterior distribution is denoted as \(P(\theta|data)\).

    Bayes' Theorem for Distributions:

    \[ P(\theta|data) = \frac{P(data|\theta) * P(\theta)}{P(data)} \]

    where:

    • \(P(\theta|data)\) is the posterior distribution.
    • \(P(data|\theta)\) is the likelihood function.
    • \(P(\theta)\) is the prior distribution.
    • \(P(data)\) is the marginal likelihood (a normalizing constant).

    Example: Suppose we want to estimate the click-through rate (\(\theta\)) of an ad. We might start with a prior distribution that reflects our belief that the CTR is around 0.1 (10%). After observing some data (clicks and impressions), we can update our belief using Bayes' Theorem to obtain a posterior distribution for the CTR.

  • 7.3 Applications in A/B Testing and Personalization

    Bayesian methods are increasingly being used in A/B testing and personalization because they offer several advantages over frequentist methods:

    • Incorporating Prior Information: Bayesian methods allow us to incorporate prior knowledge or beliefs into the analysis, which can be especially useful when data is limited.
    • Quantifying Uncertainty: Bayesian methods provide a full posterior distribution for the parameter of interest, which allows us to quantify our uncertainty about the estimate.
    • Making Decisions: Bayesian methods provide a natural framework for making decisions under uncertainty. For example, we can calculate the probability that one treatment is better than another, or we can choose the treatment that maximizes the expected utility.
    • Sequential Analysis: Bayesian methods can be easily adapted for sequential analysis, where we update our beliefs as new data comes in. This allows us to stop an experiment early if there is strong evidence in favor of one treatment.

    Example: In a Bayesian A/B test, we might start with a prior distribution for the difference in conversion rates between two versions of a website. As we collect data, we update the posterior distribution. We can then calculate the probability that the new version is better than the old version, and we can stop the test early if this probability exceeds a certain threshold.

    Wikipedia: Bayesian Inference, Wikipedia: Bayesian Statistics