Appendix
Glossary, cheatsheets, and quick reference materials for data science interviews.
📖 Glossary of Terms
Quick reference definitions for key concepts used throughout the handbook.
A
- A/B Testing
- A randomized controlled experiment comparing two versions (A and B) to determine which performs better on a defined metric. Also called split testing.
- Accuracy
- The proportion of correct predictions (true positives + true negatives) among the total number of cases examined. Formula: (TP + TN) / (TP + TN + FP + FN)
- Alpha (α)
- The significance level in hypothesis testing; the probability of rejecting the null hypothesis when it's actually true (Type I error rate). Commonly set at 0.05.
- ANOVA (Analysis of Variance)
- A statistical method for comparing means of three or more groups to determine if at least one group mean differs significantly from others.
- ARPU (Average Revenue Per User)
- A metric measuring the average revenue generated per user, commonly used in subscription and freemium business models.
B
- Bayes' Theorem
- A formula for calculating conditional probabilities: P(A|B) = P(B|A) × P(A) / P(B). Used for updating beliefs based on new evidence.
- Beta (β)
- In hypothesis testing, the probability of failing to reject the null hypothesis when it's actually false (Type II error rate). Statistical power = 1 - β.
- Binomial Distribution
- A probability distribution for the number of successes in n independent trials, each with probability p of success.
- Bias
- Systematic error that causes results to deviate from the true value in a consistent direction. In ML, it refers to underfitting due to oversimplified models.
- Bonferroni Correction
- A method to adjust significance levels when performing multiple comparisons, dividing α by the number of tests to control family-wise error rate.
C
- Central Limit Theorem (CLT)
- States that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population's distribution.
- Chi-Square Test
- A statistical test for categorical data to assess whether observed frequencies differ from expected frequencies.
- Cohort Analysis
- A type of analysis that groups users by shared characteristics (often acquisition date) to track behavior over time.
- Confidence Interval
- A range of values that likely contains the true population parameter with a specified probability (e.g., 95% CI).
- Confounding Variable
- A variable that influences both the dependent and independent variables, potentially creating a spurious association.
- Conversion Rate
- The percentage of users who complete a desired action (e.g., sign up, purchase) out of total users exposed.
- Correlation
- A statistical measure of the linear relationship between two variables, ranging from -1 to +1.
- CTE (Common Table Expression)
- A temporary named result set in SQL that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement.
D
- DAU/MAU (Daily/Monthly Active Users)
- Metrics measuring the number of unique users who engage with a product within a day or month. DAU/MAU ratio indicates stickiness.
- Degrees of Freedom
- The number of independent values that can vary in a statistical calculation. Often n-1 for sample variance.
- Distribution
- A function showing all possible values of a variable and their frequencies or probabilities.
E
- Effect Size
- A quantitative measure of the magnitude of a phenomenon. Common measures include Cohen's d and odds ratio.
- Expected Value
- The long-run average value of a random variable over many repeated experiments. E(X) = Σ(x × P(x)).
- Exponential Distribution
- A probability distribution describing time between events in a Poisson process. Characterized by constant hazard rate.
F
- F-statistic
- A ratio of two variances used in ANOVA and regression to test if group means or model parameters differ significantly.
- False Positive (Type I Error)
- Incorrectly rejecting the null hypothesis when it's actually true. Controlled by significance level α.
- False Negative (Type II Error)
- Failing to reject the null hypothesis when it's actually false. Related to statistical power.
- Feature Engineering
- The process of creating new features from raw data to improve model performance.
- Funnel Analysis
- A method of analyzing the user journey through sequential steps, measuring conversion rates at each stage.
G
- Gaussian Distribution
- Another name for normal distribution. Characterized by mean (μ) and standard deviation (σ).
- Guardrail Metrics
- Metrics monitored during experiments to ensure changes don't negatively impact critical aspects of the product.
- GROUP BY
- SQL clause that groups rows with the same values in specified columns, often used with aggregate functions.
H
- Hypothesis Testing
- A statistical method for making decisions about population parameters based on sample data, comparing null and alternative hypotheses.
- Heteroscedasticity
- When the variance of residuals is not constant across all levels of independent variables in regression.
I-J
- IQR (Interquartile Range)
- The range between the 25th and 75th percentiles. Used to identify outliers (values beyond 1.5×IQR from quartiles).
- Imputation
- The process of replacing missing data with substituted values using various strategies (mean, median, mode, or predictive methods).
- JOIN
- SQL operation combining rows from two or more tables based on a related column. Types: INNER, LEFT, RIGHT, FULL, CROSS.
K
- K-fold Cross Validation
- A model validation technique that divides data into k subsets, training on k-1 folds and testing on the remaining fold, rotating k times.
- KPI (Key Performance Indicator)
- A measurable value demonstrating how effectively a company is achieving key business objectives.
L
- Linear Regression
- A statistical method modeling the relationship between a dependent variable and one or more independent variables using a linear equation.
- Logistic Regression
- A classification algorithm that models the probability of a binary outcome using the logistic function.
- Log-Normal Distribution
- A distribution of a variable whose logarithm follows a normal distribution. Common in revenue and engagement metrics.
- LTV (Lifetime Value)
- The predicted total revenue a customer will generate throughout their relationship with a business.
M
- Mean
- The arithmetic average of a set of values. Sum of all values divided by the count. Sensitive to outliers.
- Median
- The middle value in a sorted dataset. More robust to outliers than mean.
- Mode
- The most frequently occurring value in a dataset. A distribution can have multiple modes.
- Multicollinearity
- When two or more independent variables in a regression model are highly correlated, making it difficult to isolate individual effects.
- MLE (Maximum Likelihood Estimation)
- A method for estimating parameters by finding values that maximize the likelihood of observing the data.
N
- Normal Distribution
- A symmetric, bell-shaped probability distribution characterized by mean and standard deviation. 68-95-99.7 rule applies.
- Null Hypothesis (H₀)
- The default assumption in hypothesis testing that there is no effect or difference. What we try to reject.
- NPS (Net Promoter Score)
- A customer loyalty metric calculated from survey responses asking likelihood to recommend (0-10), ranging from -100 to +100.
O-P
- Outlier
- A data point that differs significantly from other observations. May indicate measurement error or genuinely unusual cases.
- Overfitting
- When a model learns noise in training data, performing well on training data but poorly on new data.
- P-value
- The probability of observing results at least as extreme as the actual results, assuming the null hypothesis is true.
- Poisson Distribution
- A distribution describing the probability of a given number of events occurring in a fixed interval when events occur at a constant rate.
- Power (Statistical)
- The probability of correctly rejecting the null hypothesis when it's false (1 - β). Typically aim for 80%.
- Precision
- The proportion of positive predictions that are actually correct: TP / (TP + FP). Minimizes false positives.
Q-R
- Quartile
- Values that divide a dataset into four equal parts: Q1 (25th percentile), Q2 (median), Q3 (75th percentile).
- R-squared (R²)
- The proportion of variance in the dependent variable explained by the independent variables. Ranges from 0 to 1.
- Recall (Sensitivity)
- The proportion of actual positives correctly identified: TP / (TP + FN). Minimizes false negatives.
- Regression
- Statistical methods for modeling relationships between variables to make predictions or understand associations.
- Retention Rate
- The percentage of customers who continue using a product over a specific time period.
S
- Sample Size
- The number of observations in a sample. Larger samples provide more precise estimates and greater statistical power.
- Standard Deviation (σ)
- A measure of the amount of variation in a dataset. The square root of variance.
- Standard Error
- The standard deviation of a sampling distribution. For means: SE = σ/√n.
- Statistical Significance
- When the p-value is less than the chosen significance level (α), indicating the result is unlikely due to chance alone.
- Subquery
- A query nested inside another SQL query. Can be used in SELECT, FROM, WHERE, or HAVING clauses.
T
- T-test
- A statistical test comparing means. Types include one-sample, independent two-sample, and paired t-tests.
- Type I Error
- Rejecting the null hypothesis when it's true (false positive). Probability equals α.
- Type II Error
- Failing to reject the null hypothesis when it's false (false negative). Probability equals β.
- Time Series
- A sequence of data points indexed in time order, often analyzed for trends, seasonality, and patterns.
U-V
- Underfitting
- When a model is too simple to capture underlying patterns, performing poorly on both training and test data.
- Variance
- A measure of how spread out data points are from the mean. The average of squared deviations from the mean.
- VIF (Variance Inflation Factor)
- A measure of multicollinearity in regression. VIF > 10 typically indicates problematic multicollinearity.
W-Z
- Welch's t-test
- A variation of the t-test that doesn't assume equal variances between groups. More robust than Student's t-test.
- Window Function
- SQL functions that perform calculations across a set of rows related to the current row (e.g., ROW_NUMBER, RANK, LAG, LEAD).
- Z-score
- The number of standard deviations a data point is from the mean. Formula: z = (x - μ) / σ.
- Z-test
- A statistical test for comparing sample and population means when the population variance is known or sample size is large.
🔢 Quick Formulas Reference
Descriptive Statistics
- Mean: μ = Σx / n
- Variance: σ² = Σ(x - μ)² / n
- Standard Deviation: σ = √(σ²)
- Standard Error: SE = σ / √n
- Coefficient of Variation: CV = (σ / μ) * 100%
Probability
- Bayes' Theorem: P(A|B) = P(B|A) * P(A) / P(B)
- Addition Rule: P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
- Multiplication Rule: P(A ∩ B) = P(A) * P(B|A)
Hypothesis Testing
- Z-statistic: z = (x̄ - μ) / (σ / √n)
- T-statistic: t = (x̄ - μ) / (s / √n)
- Confidence Interval: x̄ ± z* × (σ / √n)
Sample Size (for proportions)
- Formula: n = (z² × p × (1-p)) / E²
- Where E is the margin of error