Programming (Python/R) for Data Analysis
Python fundamentals, Pandas, NumPy, visualization, and optional modeling libraries.
Overview
While SQL is often the primary focus in data science interviews, demonstrating proficiency in Python or R is crucial for data manipulation, analysis, and visualization. You may be asked to write code snippets, explain code logic, or discuss how you would approach a data-related problem using these languages. Expect questions involving:
- Core Libraries: Pandas, NumPy, data visualization libraries (Matplotlib, Seaborn)
- Potentially: Statistical modeling libraries (Statsmodels, scikit-learn), and other specialized libraries depending on the role (e.g., NLP libraries like NLTK or spaCy).
Interview-Grade Python: What "Good" Looks Like
- Correctness first: handle missing values, duplicate keys, timezone/date quirks, and edge cases.
- Readable & modular: small functions, clear variable names, explicit assumptions.
- Vectorized when it matters: prefer pandas operations over Python loops for large data.
- Reproducible: fixed seeds, deterministic sorts, and saved intermediate outputs when useful.
- Communicates insight: code answers a question; narrative explains why the result matters.
How to Prep
- Practice Regularly: Consistent practice is key. Work through coding exercises on platforms like HackerRank, LeetCode (Database section), and StrataScratch, focusing on data manipulation and analysis problems.
- Focus on Fundamentals: Ensure you have a solid understanding of data structures (lists, dictionaries, arrays), control flow (loops, conditional statements), and functions.
- Master Core Libraries: Become proficient in using Pandas for data manipulation, NumPy for numerical operations, and Matplotlib/Seaborn for visualization.
- Understand Data Cleaning and Transformation: Practice techniques for handling missing values, data type conversions, and data aggregation.
- Be Comfortable Explaining Your Code: Be prepared to walk through your code line by line, explaining the logic and reasoning behind your choices. Consider time and space complexity of your solutions.
- Think About Edge Cases: When designing solutions, consider potential edge cases and how your code handles them.
Resources
Key Libraries and Functionalities
Pandas
- DataFrames: Two-dimensional labeled data structures with columns of potentially different types.
- Series: One-dimensional labeled array.
- Data Cleaning: Handling missing values (e.g.,
df.fillna(),df.dropna()), removing duplicates (df.drop_duplicates()). - Data Transformation: Filtering (
df[df['Age'] > 25]), sorting (df.sort_values('Age')), adding/removing columns. - Data Aggregation: Grouping data and applying aggregate functions (e.g.,
df.groupby('City')['Age'].mean()). - Reading and Writing Data: Reading data from CSV, Excel, and other formats (
pd.read_csv(),pd.read_excel()).
NumPy
- Arrays: N-dimensional arrays for efficient numerical operations.
- Mathematical Functions: Performing mathematical operations on arrays (e.g.,
np.mean(),np.std(),np.sum()). - Linear Algebra: Matrix operations, dot products, etc.
Matplotlib/Seaborn
- Matplotlib: Creating basic plots like line plots, scatter plots, bar charts, histograms.
- Seaborn: Building on top of Matplotlib to create more visually appealing and informative statistical graphics.
Core Pandas Patterns (Most Frequently Tested)
- Join/merge:
pd.merge(left, right, on=..., how='left')+ verify key uniqueness. - Groupby aggregates:
groupby(...).agg(...)for multi-metric tables. - Reshape:
pivot_table,melt, wide-to-long and long-to-wide. - Time series:
dtaccessors, resampling, rolling windows. - Conditional logic:
np.where, boolean masks, andDataFrame.assign. - Safe counts:
nunique,value_counts(dropna=False), and explicit denominators.
Key-check pattern (prevents silent join explosions):
def assert_unique(df, cols, name="df"):
dupes = df.duplicated(cols).sum()
if dupes:
raise ValueError(f"{name}: {dupes} duplicate keys on {cols}")
assert_unique(users, ["user_id"], "users")
assert_unique(orders, ["order_id"], "orders")
df = users.merge(orders, on="user_id", how="left")
Data Cleaning Checklist (Fast + Practical)
- Schema:
df.info(), dtypes, parse dates at read time. - Missingness: quantify per column; decide drop, impute, or model as signal.
- Duplicates: define a primary key; dedupe with a rule (latest timestamp wins).
- Outliers: define a rule (winsorize, cap, remove) and justify it.
- Units: verify currency, percent vs fraction, and timezones.
Example: EDA Template You Can Reuse
import pandas as pd
import numpy as np
def quick_eda(df: pd.DataFrame, target: str | None = None) -> None:
print("shape:", df.shape)
display(df.head(3))
display(df.describe(include="all").T)
missing = df.isna().mean().sort_values(ascending=False)
display(missing[missing > 0].to_frame("missing_rate"))
if target and target in df.columns:
display(df[target].value_counts(dropna=False).head(20))
quick_eda(df, target=None)
Example: Cohort Retention in Pandas (Interview Favorite)
Input: events with user_id and event_date. Output: retention table by cohort month.
import pandas as pd
events = events.copy()
events["event_date"] = pd.to_datetime(events["event_date"], errors="coerce")
events = events.dropna(subset=["event_date"])
events["event_month"] = events["event_date"].dt.to_period("M")
cohort = events.groupby("user_id")["event_month"].min().rename("cohort")
events = events.join(cohort, on="user_id")
events["period"] = (events["event_month"] - events["cohort"]).apply(lambda p: p.n)
active = events.drop_duplicates(["user_id", "period"])
ret = active.pivot_table(index="cohort", columns="period", values="user_id", aggfunc="nunique")
ret_rate = ret.div(ret[0], axis=0)
ret_rate
Example: A/B Test Metric + Bootstrap CI
Input: user-level table with group and numeric metric.
import numpy as np
control = df.loc[df["group"] == "control", "metric"].astype(float)
treat = df.loc[df["group"] == "treatment", "metric"].astype(float)
ate = treat.mean() - control.mean()
rng = np.random.default_rng(42)
boot = []
for _ in range(3000):
boot.append(rng.choice(treat, len(treat), replace=True).mean() - rng.choice(control, len(control), replace=True).mean())
ci = np.percentile(boot, [2.5, 97.5])
print({"ate": ate, "ci_low": ci[0], "ci_high": ci[1]})
Tip: In interviews, say whether your CI is for the mean difference and what assumptions you are making (IID users, stable assignment, etc.).
Common Pitfalls (And How to Avoid Them)
- Chained assignment: prefer
df = df.assign(...)ordf.loc[mask, col] = .... - SettingWithCopy warnings: use
.copy()when subsetting. - Silent type issues: parse numerics with
pd.to_numeric(..., errors='coerce'). - Timezone bugs: standardize to UTC before grouping by day.
- Speed: avoid row-wise
applywhen vectorization exists.
Example: Creating a DataFrame
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 28],
'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)
Example: NumPy Array
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(f"Mean: {np.mean(arr)}")
print(f"Std: {np.std(arr)}")
Practice Drills (Try Without Looking)
- Groupby: compute DAU by day and platform from an events table.
- Join: left-join orders to users and compute conversion rate by signup week.
- Time series: compute a 7-day rolling average of revenue.
- Outliers: cap metric at 99th percentile per country, then recompute mean.
- Debugging: fix a merge that unexpectedly triples row count.
Test Your Knowledge
🧠 Python Data Analysis Quiz
Test your knowledge of Pandas, NumPy, and Python data manipulation techniques.
1 Which Pandas method is used to remove duplicate rows from a DataFrame?
2 What is the primary difference between a Pandas Series and a DataFrame?
3 Which NumPy function calculates the standard deviation of an array?
4 How do you select rows in a DataFrame where the 'Age' column is greater than 25?
5 What does the groupby() method in Pandas do?