Programming (Python/R) for Data Analysis

Overview

While SQL is often the primary focus in data science interviews, demonstrating proficiency in Python or R is crucial for data manipulation, analysis, and visualization. You may be asked to write code snippets, explain code logic, or discuss how you would approach a data-related problem using these languages. Expect questions involving:

Core Libraries: Pandas, NumPy, data visualization libraries (Matplotlib, Seaborn)
Potentially: Statistical modeling libraries (Statsmodels, scikit-learn), and other specialized libraries depending on the role (e.g., NLP libraries like NLTK or spaCy).

Interview-Grade Python: What "Good" Looks Like

Correctness first: handle missing values, duplicate keys, timezone/date quirks, and edge cases.
Readable & modular: small functions, clear variable names, explicit assumptions.
Vectorized when it matters: prefer pandas operations over Python loops for large data.
Reproducible: fixed seeds, deterministic sorts, and saved intermediate outputs when useful.
Communicates insight: code answers a question; narrative explains why the result matters.

How to Prep

Practice Regularly: Consistent practice is key. Work through coding exercises on platforms like HackerRank, LeetCode (Database section), and StrataScratch, focusing on data manipulation and analysis problems.
Focus on Fundamentals: Ensure you have a solid understanding of data structures (lists, dictionaries, arrays), control flow (loops, conditional statements), and functions.
Master Core Libraries: Become proficient in using Pandas for data manipulation, NumPy for numerical operations, and Matplotlib/Seaborn for visualization.
Understand Data Cleaning and Transformation: Practice techniques for handling missing values, data type conversions, and data aggregation.
Be Comfortable Explaining Your Code: Be prepared to walk through your code line by line, explaining the logic and reasoning behind your choices. Consider time and space complexity of your solutions.
Think About Edge Cases: When designing solutions, consider potential edge cases and how your code handles them.

Resources

Pandas:
- Official Documentation
- 10 Minutes to pandas
NumPy:
- Official Documentation
- NumPy Quickstart Tutorial
Matplotlib:
- Official Documentation
- Matplotlib Tutorials
Seaborn:
- Official Documentation
- Seaborn Tutorials
Python Data Science Handbook

Key Libraries and Functionalities

Pandas

DataFrames: Two-dimensional labeled data structures with columns of potentially different types.
Series: One-dimensional labeled array.
Data Cleaning: Handling missing values (e.g., df.fillna(), df.dropna()), removing duplicates (df.drop_duplicates()).
Data Transformation: Filtering (df[df['Age'] > 25]), sorting (df.sort_values('Age')), adding/removing columns.
Data Aggregation: Grouping data and applying aggregate functions (e.g., df.groupby('City')['Age'].mean()).
Reading and Writing Data: Reading data from CSV, Excel, and other formats (pd.read_csv(), pd.read_excel()).

NumPy

Arrays: N-dimensional arrays for efficient numerical operations.
Mathematical Functions: Performing mathematical operations on arrays (e.g., np.mean(), np.std(), np.sum()).
Linear Algebra: Matrix operations, dot products, etc.

Matplotlib/Seaborn

Matplotlib: Creating basic plots like line plots, scatter plots, bar charts, histograms.
Seaborn: Building on top of Matplotlib to create more visually appealing and informative statistical graphics.

Core Pandas Patterns (Most Frequently Tested)

Join/merge: pd.merge(left, right, on=..., how='left') + verify key uniqueness.
Groupby aggregates: groupby(...).agg(...) for multi-metric tables.
Reshape: pivot_table, melt, wide-to-long and long-to-wide.
Time series: dt accessors, resampling, rolling windows.
Conditional logic: np.where, boolean masks, and DataFrame.assign.
Safe counts: nunique, value_counts(dropna=False), and explicit denominators.

Key-check pattern (prevents silent join explosions):

def assert_unique(df, cols, name="df"):
    dupes = df.duplicated(cols).sum()
    if dupes:
        raise ValueError(f"{name}: {dupes} duplicate keys on {cols}")

assert_unique(users, ["user_id"], "users")
assert_unique(orders, ["order_id"], "orders")

df = users.merge(orders, on="user_id", how="left")

Data Cleaning Checklist (Fast + Practical)

Schema: df.info(), dtypes, parse dates at read time.
Missingness: quantify per column; decide drop, impute, or model as signal.
Duplicates: define a primary key; dedupe with a rule (latest timestamp wins).
Outliers: define a rule (winsorize, cap, remove) and justify it.
Units: verify currency, percent vs fraction, and timezones.

Example: EDA Template You Can Reuse

import pandas as pd
import numpy as np

def quick_eda(df: pd.DataFrame, target: str | None = None) -> None:
    print("shape:", df.shape)
    display(df.head(3))
    display(df.describe(include="all").T)
    missing = df.isna().mean().sort_values(ascending=False)
    display(missing[missing > 0].to_frame("missing_rate"))
    if target and target in df.columns:
        display(df[target].value_counts(dropna=False).head(20))

quick_eda(df, target=None)

Example: Cohort Retention in Pandas (Interview Favorite)

Input: events with user_id and event_date. Output: retention table by cohort month.

import pandas as pd

events = events.copy()
events["event_date"] = pd.to_datetime(events["event_date"], errors="coerce")
events = events.dropna(subset=["event_date"])
events["event_month"] = events["event_date"].dt.to_period("M")

cohort = events.groupby("user_id")["event_month"].min().rename("cohort")
events = events.join(cohort, on="user_id")
events["period"] = (events["event_month"] - events["cohort"]).apply(lambda p: p.n)

active = events.drop_duplicates(["user_id", "period"])
ret = active.pivot_table(index="cohort", columns="period", values="user_id", aggfunc="nunique")
ret_rate = ret.div(ret[0], axis=0)

ret_rate

Example: A/B Test Metric + Bootstrap CI

Input: user-level table with group and numeric metric.

import numpy as np

control = df.loc[df["group"] == "control", "metric"].astype(float)
treat = df.loc[df["group"] == "treatment", "metric"].astype(float)

ate = treat.mean() - control.mean()

rng = np.random.default_rng(42)
boot = []
for _ in range(3000):
    boot.append(rng.choice(treat, len(treat), replace=True).mean() - rng.choice(control, len(control), replace=True).mean())

ci = np.percentile(boot, [2.5, 97.5])
print({"ate": ate, "ci_low": ci[0], "ci_high": ci[1]})

Tip: In interviews, say whether your CI is for the mean difference and what assumptions you are making (IID users, stable assignment, etc.).

Common Pitfalls (And How to Avoid Them)

Chained assignment: prefer df = df.assign(...) or df.loc[mask, col] = ....
SettingWithCopy warnings: use .copy() when subsetting.
Silent type issues: parse numerics with pd.to_numeric(..., errors='coerce').
Timezone bugs: standardize to UTC before grouping by day.
Speed: avoid row-wise apply when vectorization exists.

Example: Creating a DataFrame

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 
        'Age': [25, 30, 28], 
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)

Example: NumPy Array

import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(f"Mean: {np.mean(arr)}")
print(f"Std: {np.std(arr)}")

Practice Drills (Try Without Looking)

Groupby: compute DAU by day and platform from an events table.
Join: left-join orders to users and compute conversion rate by signup week.
Time series: compute a 7-day rolling average of revenue.
Outliers: cap metric at 99th percentile per country, then recompute mean.
Debugging: fix a merge that unexpectedly triples row count.