Appendix

📖 Glossary of Terms

Quick reference definitions for key concepts used throughout the handbook.

A

A/B Testing: A randomized controlled experiment comparing two versions (A and B) to determine which performs better on a defined metric. Also called split testing.
Accuracy: The proportion of correct predictions (true positives + true negatives) among the total number of cases examined. Formula: (TP + TN) / (TP + TN + FP + FN)
Alpha (α): The significance level in hypothesis testing; the probability of rejecting the null hypothesis when it's actually true (Type I error rate). Commonly set at 0.05.
ANOVA (Analysis of Variance): A statistical method for comparing means of three or more groups to determine if at least one group mean differs significantly from others.
ARPU (Average Revenue Per User): A metric measuring the average revenue generated per user, commonly used in subscription and freemium business models.

B

Bayes' Theorem: A formula for calculating conditional probabilities: P(A|B) = P(B|A) × P(A) / P(B). Used for updating beliefs based on new evidence.
Beta (β): In hypothesis testing, the probability of failing to reject the null hypothesis when it's actually false (Type II error rate). Statistical power = 1 - β.
Binomial Distribution: A probability distribution for the number of successes in n independent trials, each with probability p of success.
Bias: Systematic error that causes results to deviate from the true value in a consistent direction. In ML, it refers to underfitting due to oversimplified models.
Bonferroni Correction: A method to adjust significance levels when performing multiple comparisons, dividing α by the number of tests to control family-wise error rate.

C

Central Limit Theorem (CLT): States that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population's distribution.
Chi-Square Test: A statistical test for categorical data to assess whether observed frequencies differ from expected frequencies.
Cohort Analysis: A type of analysis that groups users by shared characteristics (often acquisition date) to track behavior over time.
Confidence Interval: A range of values that likely contains the true population parameter with a specified probability (e.g., 95% CI).
Confounding Variable: A variable that influences both the dependent and independent variables, potentially creating a spurious association.
Conversion Rate: The percentage of users who complete a desired action (e.g., sign up, purchase) out of total users exposed.
Correlation: A statistical measure of the linear relationship between two variables, ranging from -1 to +1.
CTE (Common Table Expression): A temporary named result set in SQL that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement.

D

DAU/MAU (Daily/Monthly Active Users): Metrics measuring the number of unique users who engage with a product within a day or month. DAU/MAU ratio indicates stickiness.
Degrees of Freedom: The number of independent values that can vary in a statistical calculation. Often n-1 for sample variance.
Distribution: A function showing all possible values of a variable and their frequencies or probabilities.

E

Effect Size: A quantitative measure of the magnitude of a phenomenon. Common measures include Cohen's d and odds ratio.
Expected Value: The long-run average value of a random variable over many repeated experiments. E(X) = Σ(x × P(x)).
Exponential Distribution: A probability distribution describing time between events in a Poisson process. Characterized by constant hazard rate.

F

F-statistic: A ratio of two variances used in ANOVA and regression to test if group means or model parameters differ significantly.
False Positive (Type I Error): Incorrectly rejecting the null hypothesis when it's actually true. Controlled by significance level α.
False Negative (Type II Error): Failing to reject the null hypothesis when it's actually false. Related to statistical power.
Feature Engineering: The process of creating new features from raw data to improve model performance.
Funnel Analysis: A method of analyzing the user journey through sequential steps, measuring conversion rates at each stage.

G

Gaussian Distribution: Another name for normal distribution. Characterized by mean (μ) and standard deviation (σ).
Guardrail Metrics: Metrics monitored during experiments to ensure changes don't negatively impact critical aspects of the product.
GROUP BY: SQL clause that groups rows with the same values in specified columns, often used with aggregate functions.

H

Hypothesis Testing: A statistical method for making decisions about population parameters based on sample data, comparing null and alternative hypotheses.
Heteroscedasticity: When the variance of residuals is not constant across all levels of independent variables in regression.

I-J

IQR (Interquartile Range): The range between the 25th and 75th percentiles. Used to identify outliers (values beyond 1.5×IQR from quartiles).
Imputation: The process of replacing missing data with substituted values using various strategies (mean, median, mode, or predictive methods).
JOIN: SQL operation combining rows from two or more tables based on a related column. Types: INNER, LEFT, RIGHT, FULL, CROSS.

K

K-fold Cross Validation: A model validation technique that divides data into k subsets, training on k-1 folds and testing on the remaining fold, rotating k times.
KPI (Key Performance Indicator): A measurable value demonstrating how effectively a company is achieving key business objectives.

L

Linear Regression: A statistical method modeling the relationship between a dependent variable and one or more independent variables using a linear equation.
Logistic Regression: A classification algorithm that models the probability of a binary outcome using the logistic function.
Log-Normal Distribution: A distribution of a variable whose logarithm follows a normal distribution. Common in revenue and engagement metrics.
LTV (Lifetime Value): The predicted total revenue a customer will generate throughout their relationship with a business.

M

Mean: The arithmetic average of a set of values. Sum of all values divided by the count. Sensitive to outliers.
Median: The middle value in a sorted dataset. More robust to outliers than mean.
Mode: The most frequently occurring value in a dataset. A distribution can have multiple modes.
Multicollinearity: When two or more independent variables in a regression model are highly correlated, making it difficult to isolate individual effects.
MLE (Maximum Likelihood Estimation): A method for estimating parameters by finding values that maximize the likelihood of observing the data.

N

Normal Distribution: A symmetric, bell-shaped probability distribution characterized by mean and standard deviation. 68-95-99.7 rule applies.
Null Hypothesis (H₀): The default assumption in hypothesis testing that there is no effect or difference. What we try to reject.
NPS (Net Promoter Score): A customer loyalty metric calculated from survey responses asking likelihood to recommend (0-10), ranging from -100 to +100.

O-P

Outlier: A data point that differs significantly from other observations. May indicate measurement error or genuinely unusual cases.
Overfitting: When a model learns noise in training data, performing well on training data but poorly on new data.
P-value: The probability of observing results at least as extreme as the actual results, assuming the null hypothesis is true.
Poisson Distribution: A distribution describing the probability of a given number of events occurring in a fixed interval when events occur at a constant rate.
Power (Statistical): The probability of correctly rejecting the null hypothesis when it's false (1 - β). Typically aim for 80%.
Precision: The proportion of positive predictions that are actually correct: TP / (TP + FP). Minimizes false positives.

Q-R

Quartile: Values that divide a dataset into four equal parts: Q1 (25th percentile), Q2 (median), Q3 (75th percentile).
R-squared (R²): The proportion of variance in the dependent variable explained by the independent variables. Ranges from 0 to 1.
Recall (Sensitivity): The proportion of actual positives correctly identified: TP / (TP + FN). Minimizes false negatives.
Regression: Statistical methods for modeling relationships between variables to make predictions or understand associations.
Retention Rate: The percentage of customers who continue using a product over a specific time period.

S

Sample Size: The number of observations in a sample. Larger samples provide more precise estimates and greater statistical power.
Standard Deviation (σ): A measure of the amount of variation in a dataset. The square root of variance.
Standard Error: The standard deviation of a sampling distribution. For means: SE = σ/√n.
Statistical Significance: When the p-value is less than the chosen significance level (α), indicating the result is unlikely due to chance alone.
Subquery: A query nested inside another SQL query. Can be used in SELECT, FROM, WHERE, or HAVING clauses.

T

T-test: A statistical test comparing means. Types include one-sample, independent two-sample, and paired t-tests.
Type I Error: Rejecting the null hypothesis when it's true (false positive). Probability equals α.
Type II Error: Failing to reject the null hypothesis when it's false (false negative). Probability equals β.
Time Series: A sequence of data points indexed in time order, often analyzed for trends, seasonality, and patterns.

U-V

Underfitting: When a model is too simple to capture underlying patterns, performing poorly on both training and test data.
Variance: A measure of how spread out data points are from the mean. The average of squared deviations from the mean.
VIF (Variance Inflation Factor): A measure of multicollinearity in regression. VIF > 10 typically indicates problematic multicollinearity.

W-Z

Welch's t-test: A variation of the t-test that doesn't assume equal variances between groups. More robust than Student's t-test.
Window Function: SQL functions that perform calculations across a set of rows related to the current row (e.g., ROW_NUMBER, RANK, LAG, LEAD).
Z-score: The number of standard deviations a data point is from the mean. Formula: z = (x - μ) / σ.
Z-test: A statistical test for comparing sample and population means when the population variance is known or sample size is large.

🔢 Quick Formulas Reference

Descriptive Statistics

Mean: μ = Σx / n
Variance: σ² = Σ(x - μ)² / n
Standard Deviation: σ = √(σ²)
Standard Error: SE = σ / √n
Coefficient of Variation: CV = (σ / μ) * 100%

Probability

Bayes' Theorem: P(A|B) = P(B|A) * P(A) / P(B)
Addition Rule: P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
Multiplication Rule: P(A ∩ B) = P(A) * P(B|A)

Hypothesis Testing

Z-statistic: z = (x̄ - μ) / (σ / √n)
T-statistic: t = (x̄ - μ) / (s / √n)
Confidence Interval: x̄ ± z* × (σ / √n)

Sample Size (for proportions)

Formula: n = (z² × p × (1-p)) / E²
Where E is the margin of error