Appendix

Glossary, cheatsheets, and quick reference materials for data science interviews.

📖 Glossary of Terms

Quick reference definitions for key concepts used throughout the handbook.

A

A/B Testing
A randomized controlled experiment comparing two versions (A and B) to determine which performs better on a defined metric. Also called split testing.
Accuracy
The proportion of correct predictions (true positives + true negatives) among the total number of cases examined. Formula: (TP + TN) / (TP + TN + FP + FN)
Alpha (α)
The significance level in hypothesis testing; the probability of rejecting the null hypothesis when it's actually true (Type I error rate). Commonly set at 0.05.
ANOVA (Analysis of Variance)
A statistical method for comparing means of three or more groups to determine if at least one group mean differs significantly from others.
ARPU (Average Revenue Per User)
A metric measuring the average revenue generated per user, commonly used in subscription and freemium business models.

B

Bayes' Theorem
A formula for calculating conditional probabilities: P(A|B) = P(B|A) × P(A) / P(B). Used for updating beliefs based on new evidence.
Beta (β)
In hypothesis testing, the probability of failing to reject the null hypothesis when it's actually false (Type II error rate). Statistical power = 1 - β.
Binomial Distribution
A probability distribution for the number of successes in n independent trials, each with probability p of success.
Bias
Systematic error that causes results to deviate from the true value in a consistent direction. In ML, it refers to underfitting due to oversimplified models.
Bonferroni Correction
A method to adjust significance levels when performing multiple comparisons, dividing α by the number of tests to control family-wise error rate.

C

Central Limit Theorem (CLT)
States that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population's distribution.
Chi-Square Test
A statistical test for categorical data to assess whether observed frequencies differ from expected frequencies.
Cohort Analysis
A type of analysis that groups users by shared characteristics (often acquisition date) to track behavior over time.
Confidence Interval
A range of values that likely contains the true population parameter with a specified probability (e.g., 95% CI).
Confounding Variable
A variable that influences both the dependent and independent variables, potentially creating a spurious association.
Conversion Rate
The percentage of users who complete a desired action (e.g., sign up, purchase) out of total users exposed.
Correlation
A statistical measure of the linear relationship between two variables, ranging from -1 to +1.
CTE (Common Table Expression)
A temporary named result set in SQL that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement.

D

DAU/MAU (Daily/Monthly Active Users)
Metrics measuring the number of unique users who engage with a product within a day or month. DAU/MAU ratio indicates stickiness.
Degrees of Freedom
The number of independent values that can vary in a statistical calculation. Often n-1 for sample variance.
Distribution
A function showing all possible values of a variable and their frequencies or probabilities.

E

Effect Size
A quantitative measure of the magnitude of a phenomenon. Common measures include Cohen's d and odds ratio.
Expected Value
The long-run average value of a random variable over many repeated experiments. E(X) = Σ(x × P(x)).
Exponential Distribution
A probability distribution describing time between events in a Poisson process. Characterized by constant hazard rate.

F

F-statistic
A ratio of two variances used in ANOVA and regression to test if group means or model parameters differ significantly.
False Positive (Type I Error)
Incorrectly rejecting the null hypothesis when it's actually true. Controlled by significance level α.
False Negative (Type II Error)
Failing to reject the null hypothesis when it's actually false. Related to statistical power.
Feature Engineering
The process of creating new features from raw data to improve model performance.
Funnel Analysis
A method of analyzing the user journey through sequential steps, measuring conversion rates at each stage.

G

Gaussian Distribution
Another name for normal distribution. Characterized by mean (μ) and standard deviation (σ).
Guardrail Metrics
Metrics monitored during experiments to ensure changes don't negatively impact critical aspects of the product.
GROUP BY
SQL clause that groups rows with the same values in specified columns, often used with aggregate functions.

H

Hypothesis Testing
A statistical method for making decisions about population parameters based on sample data, comparing null and alternative hypotheses.
Heteroscedasticity
When the variance of residuals is not constant across all levels of independent variables in regression.

I-J

IQR (Interquartile Range)
The range between the 25th and 75th percentiles. Used to identify outliers (values beyond 1.5×IQR from quartiles).
Imputation
The process of replacing missing data with substituted values using various strategies (mean, median, mode, or predictive methods).
JOIN
SQL operation combining rows from two or more tables based on a related column. Types: INNER, LEFT, RIGHT, FULL, CROSS.

K

K-fold Cross Validation
A model validation technique that divides data into k subsets, training on k-1 folds and testing on the remaining fold, rotating k times.
KPI (Key Performance Indicator)
A measurable value demonstrating how effectively a company is achieving key business objectives.

L

Linear Regression
A statistical method modeling the relationship between a dependent variable and one or more independent variables using a linear equation.
Logistic Regression
A classification algorithm that models the probability of a binary outcome using the logistic function.
Log-Normal Distribution
A distribution of a variable whose logarithm follows a normal distribution. Common in revenue and engagement metrics.
LTV (Lifetime Value)
The predicted total revenue a customer will generate throughout their relationship with a business.

M

Mean
The arithmetic average of a set of values. Sum of all values divided by the count. Sensitive to outliers.
Median
The middle value in a sorted dataset. More robust to outliers than mean.
Mode
The most frequently occurring value in a dataset. A distribution can have multiple modes.
Multicollinearity
When two or more independent variables in a regression model are highly correlated, making it difficult to isolate individual effects.
MLE (Maximum Likelihood Estimation)
A method for estimating parameters by finding values that maximize the likelihood of observing the data.

N

Normal Distribution
A symmetric, bell-shaped probability distribution characterized by mean and standard deviation. 68-95-99.7 rule applies.
Null Hypothesis (H₀)
The default assumption in hypothesis testing that there is no effect or difference. What we try to reject.
NPS (Net Promoter Score)
A customer loyalty metric calculated from survey responses asking likelihood to recommend (0-10), ranging from -100 to +100.

O-P

Outlier
A data point that differs significantly from other observations. May indicate measurement error or genuinely unusual cases.
Overfitting
When a model learns noise in training data, performing well on training data but poorly on new data.
P-value
The probability of observing results at least as extreme as the actual results, assuming the null hypothesis is true.
Poisson Distribution
A distribution describing the probability of a given number of events occurring in a fixed interval when events occur at a constant rate.
Power (Statistical)
The probability of correctly rejecting the null hypothesis when it's false (1 - β). Typically aim for 80%.
Precision
The proportion of positive predictions that are actually correct: TP / (TP + FP). Minimizes false positives.

Q-R

Quartile
Values that divide a dataset into four equal parts: Q1 (25th percentile), Q2 (median), Q3 (75th percentile).
R-squared (R²)
The proportion of variance in the dependent variable explained by the independent variables. Ranges from 0 to 1.
Recall (Sensitivity)
The proportion of actual positives correctly identified: TP / (TP + FN). Minimizes false negatives.
Regression
Statistical methods for modeling relationships between variables to make predictions or understand associations.
Retention Rate
The percentage of customers who continue using a product over a specific time period.

S

Sample Size
The number of observations in a sample. Larger samples provide more precise estimates and greater statistical power.
Standard Deviation (σ)
A measure of the amount of variation in a dataset. The square root of variance.
Standard Error
The standard deviation of a sampling distribution. For means: SE = σ/√n.
Statistical Significance
When the p-value is less than the chosen significance level (α), indicating the result is unlikely due to chance alone.
Subquery
A query nested inside another SQL query. Can be used in SELECT, FROM, WHERE, or HAVING clauses.

T

T-test
A statistical test comparing means. Types include one-sample, independent two-sample, and paired t-tests.
Type I Error
Rejecting the null hypothesis when it's true (false positive). Probability equals α.
Type II Error
Failing to reject the null hypothesis when it's false (false negative). Probability equals β.
Time Series
A sequence of data points indexed in time order, often analyzed for trends, seasonality, and patterns.

U-V

Underfitting
When a model is too simple to capture underlying patterns, performing poorly on both training and test data.
Variance
A measure of how spread out data points are from the mean. The average of squared deviations from the mean.
VIF (Variance Inflation Factor)
A measure of multicollinearity in regression. VIF > 10 typically indicates problematic multicollinearity.

W-Z

Welch's t-test
A variation of the t-test that doesn't assume equal variances between groups. More robust than Student's t-test.
Window Function
SQL functions that perform calculations across a set of rows related to the current row (e.g., ROW_NUMBER, RANK, LAG, LEAD).
Z-score
The number of standard deviations a data point is from the mean. Formula: z = (x - μ) / σ.
Z-test
A statistical test for comparing sample and population means when the population variance is known or sample size is large.

🔢 Quick Formulas Reference

Descriptive Statistics

  • Mean: μ = Σx / n
  • Variance: σ² = Σ(x - μ)² / n
  • Standard Deviation: σ = √(σ²)
  • Standard Error: SE = σ / √n
  • Coefficient of Variation: CV = (σ / μ) * 100%

Probability

  • Bayes' Theorem: P(A|B) = P(B|A) * P(A) / P(B)
  • Addition Rule: P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
  • Multiplication Rule: P(A ∩ B) = P(A) * P(B|A)

Hypothesis Testing

  • Z-statistic: z = (x̄ - μ) / (σ / √n)
  • T-statistic: t = (x̄ - μ) / (s / √n)
  • Confidence Interval: x̄ ± z* × (σ / √n)

Sample Size (for proportions)

  • Formula: n = (z² × p × (1-p)) / E²
  • Where E is the margin of error
30 mins Beginner