Technical Skills Interview (Coding/SQL)

SQL deep dive, Python/R data manipulation, and hands-on practice.

What to Expect

This interview assesses your coding and problem-solving abilities using data. Expect SQL-heavy questions, but be prepared to use your preferred language (Python/R) for data manipulation and analysis tasks.

Key Areas

  • SQL Proficiency: Writing efficient and complex queries involving joins, aggregations, window functions, subqueries, and CTEs. Be prepared to optimize queries for performance.
  • Data Manipulation (Python/R): Using Pandas/dplyr, NumPy/base R for data cleaning, transformation, and analysis.
  • Algorithm Implementation (Less Common): In some cases, you might be asked to implement basic algorithms or data structures.

How to Prepare

  • Understand the Problem Thoroughly: Before jumping into coding, ensure you fully understand the problem. Ask clarifying questions to the interviewer to confirm your understanding of the requirements, edge cases, and any constraints.
  • Communicate Your Thought Process: While practicing, solve problems by "thinking out loud." Verbalize your thought process, explain your chosen approach, and justify your decisions.
  • Plan Your Approach: Take a moment to plan how you want to solve the problem before you start coding. Break down the problem into smaller, manageable subproblems.
  • Explain Trade-offs: Be prepared to discuss the trade-offs of different approaches.
  • Practice SQL Extensively: Use platforms like SQLZOO, HackerRank SQL, LeetCode Database, and StrataScratch.
  • Master Data Manipulation Libraries: Become very comfortable with Pandas (Python) or dplyr (R).
  • Focus on Problem-Solving: Practice breaking down complex problems into smaller, manageable parts.
  • Write Clean and Efficient Code: Pay attention to code readability, style, and efficiency.
  • Mock Interviews: Practice coding interviews with friends or using platforms like Pramp or InterviewBit.

Common Analytical Patterns

While every problem is unique, many data analysis questions fall into common patterns:

  1. Filtering and Aggregation: Filtering data based on certain criteria and then aggregating it using functions like COUNT, SUM, AVG, MIN, MAX.
  2. Joining and Combining Data: Combining data from multiple tables using JOINs.
  3. Ranking and Ordering: Ranking data based on certain criteria or ordering it in a specific way.
  4. Window Functions: Performing calculations across a set of rows that are related to the current row.
  5. Time Series Analysis: Analyzing data over time to identify trends, patterns, and anomalies.

Example Question (SQL)

Given a table UserActivity (user_id, activity_date, activity_type), write a query to find the number of users who performed each activity type on each date.

SELECT activity_date, activity_type, COUNT(DISTINCT user_id) AS num_users
FROM UserActivity
GROUP BY activity_date, activity_type;

Example Question (Python/Pandas)

Calculate the average age by city using Pandas:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 30, 28, 22, 27],
        'City': ['New York', 'London', 'New York', 'London', 'Paris']}
df = pd.DataFrame(data)

average_age_by_city = df.groupby('City')['Age'].mean()
print(average_age_by_city)

๐Ÿง  Interview-Grade SQL Challenges

Challenge 1: 7-Day Rolling Active Users

Difficulty: ๐ŸŸก Medium

Table: daily_logins (user_id, login_date)

Question: Calculate the 7-day rolling count of unique active users for each day.

๐Ÿ’ก Hint

Use a self-join with a date range, or use RANGE BETWEEN in a window function.

โœ… Solution
-- Approach 1: Self-join
WITH daily_users AS (
  SELECT DISTINCT login_date, user_id
  FROM daily_logins
)
SELECT 
  d1.login_date,
  COUNT(DISTINCT d2.user_id) AS rolling_7d_users
FROM (SELECT DISTINCT login_date FROM daily_logins) d1
JOIN daily_users d2 
  ON d2.login_date BETWEEN d1.login_date - INTERVAL '6 days' AND d1.login_date
GROUP BY d1.login_date
ORDER BY d1.login_date;

-- Approach 2: Window function with date spine
WITH date_spine AS (
  SELECT generate_series(
    (SELECT MIN(login_date) FROM daily_logins),
    (SELECT MAX(login_date) FROM daily_logins),
    '1 day'::interval
  )::date AS dt
),
user_presence AS (
  SELECT 
    ds.dt,
    dl.user_id
  FROM date_spine ds
  LEFT JOIN daily_logins dl 
    ON dl.login_date BETWEEN ds.dt - INTERVAL '6 days' AND ds.dt
)
SELECT 
  dt AS login_date,
  COUNT(DISTINCT user_id) AS rolling_7d_users
FROM user_presence
GROUP BY dt
ORDER BY dt;

Key Insight:

The self-join approach counts unique users across a 7-day window. Note that - INTERVAL '6 days' gives you 7 days total (today + 6 previous).

Challenge 2: First Purchase by Category

Difficulty: ๐ŸŸก Medium

Table: purchases (user_id, category, purchase_date, amount)

Question: For each user, find their first purchase in each category.

โœ… Solution
-- Using DISTINCT ON (Postgres-specific, fastest)
SELECT DISTINCT ON (user_id, category)
  user_id,
  category,
  purchase_date AS first_purchase_date,
  amount AS first_purchase_amount
FROM purchases
ORDER BY user_id, category, purchase_date ASC;

-- Using ROW_NUMBER (universal)
WITH ranked AS (
  SELECT 
    *,
    ROW_NUMBER() OVER (
      PARTITION BY user_id, category 
      ORDER BY purchase_date ASC
    ) AS rn
  FROM purchases
)
SELECT user_id, category, purchase_date, amount
FROM ranked
WHERE rn = 1;
Challenge 3: Identify Power Users

Difficulty: ๐Ÿ”ด Hard

Tables:

  • sessions (user_id, session_start, session_end)
  • purchases (user_id, purchase_date, amount)

Question: Find users who:

  1. Had 10+ sessions in the last 30 days
  2. AND made 3+ purchases in the last 30 days
  3. AND have total purchase amount > $500
โœ… Solution
WITH session_counts AS (
  SELECT 
    user_id,
    COUNT(*) AS sessions_30d
  FROM sessions
  WHERE session_start >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY user_id
  HAVING COUNT(*) >= 10
),
purchase_stats AS (
  SELECT 
    user_id,
    COUNT(*) AS purchases_30d,
    SUM(amount) AS total_amount
  FROM purchases
  WHERE purchase_date >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY user_id
  HAVING COUNT(*) >= 3 AND SUM(amount) > 500
)
SELECT 
  s.user_id,
  s.sessions_30d,
  p.purchases_30d,
  p.total_amount
FROM session_counts s
JOIN purchase_stats p USING (user_id);

Pattern:

Use CTEs to calculate each criterion separately, then JOIN to find users meeting ALL criteria.

๐Ÿ Interview-Grade Python Challenges

Challenge 1: Clean and Dedupe Customer Data

Difficulty: ๐ŸŸก Medium

Data:

customers = pd.DataFrame({
    'id': [1, 2, 2, 3, 4, 4],
    'name': ['Alice', 'Bob', 'Robert', 'Carol', 'Dave', 'Dave'],
    'email': ['alice@a.com', 'bob@b.com', 'bob@b.com', None, 'dave@d.com', 'dave@d.com'],
    'signup_date': ['2025-01-01', '2025-01-05', '2025-01-06', '2025-01-10', '2025-01-15', '2025-01-20'],
    'is_verified': [True, False, True, False, True, False]
})

Task:

  1. Dedupe by customer ID, keeping the row with is_verified=True (or most recent signup if no verified)
  2. Fill missing emails with 'unknown@company.com'
  3. Ensure signup_date is datetime type
โœ… Solution
import pandas as pd

# Start with the data
customers = pd.DataFrame({
    'id': [1, 2, 2, 3, 4, 4],
    'name': ['Alice', 'Bob', 'Robert', 'Carol', 'Dave', 'Dave'],
    'email': ['alice@a.com', 'bob@b.com', 'bob@b.com', None, 'dave@d.com', 'dave@d.com'],
    'signup_date': ['2025-01-01', '2025-01-05', '2025-01-06', '2025-01-10', '2025-01-15', '2025-01-20'],
    'is_verified': [True, False, True, False, True, False]
})

# Convert signup_date to datetime
customers['signup_date'] = pd.to_datetime(customers['signup_date'])

# Sort by id, is_verified (True first), then signup_date (most recent first)
customers_sorted = customers.sort_values(
    by=['id', 'is_verified', 'signup_date'],
    ascending=[True, False, False]
)

# Keep first occurrence per id (which is verified=True or most recent)
customers_deduped = customers_sorted.drop_duplicates(subset='id', keep='first')

# Fill missing emails
customers_deduped['email'] = customers_deduped['email'].fillna('unknown@company.com')

print(customers_deduped)

Output:

   id   name         email signup_date  is_verified
0   1  Alice   alice@a.com  2025-01-01         True
2   2  Robert   bob@b.com  2025-01-06         True
3   3  Carol  unknown@...  2025-01-10        False
4   4   Dave  dave@d.com  2025-01-15         True
Challenge 2: Calculate Cohort Retention in Pandas

Difficulty: ๐Ÿ”ด Hard

Data:

users = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'signup_date': ['2025-01-01', '2025-01-15', '2025-02-01', '2025-02-10', '2025-03-01']
})

activity = pd.DataFrame({
    'user_id': [1, 1, 1, 2, 2, 3, 3, 4, 5],
    'activity_date': ['2025-01-05', '2025-02-10', '2025-03-15', 
                      '2025-01-20', '2025-02-20', '2025-02-05', '2025-03-05',
                      '2025-02-15', '2025-03-10']
})

Task: Create a cohort retention table showing the % of users retained in months 0, 1, 2, 3 after signup.

โœ… Solution
import pandas as pd

# Prepare data
users['signup_date'] = pd.to_datetime(users['signup_date'])
activity['activity_date'] = pd.to_datetime(activity['activity_date'])

# Add cohort month to users
users['cohort_month'] = users['signup_date'].dt.to_period('M')

# Merge activity with users
merged = activity.merge(users[['user_id', 'signup_date', 'cohort_month']], on='user_id')

# Calculate months since signup
merged['months_since_signup'] = (
    (merged['activity_date'].dt.to_period('M') - merged['cohort_month'])
    .apply(lambda x: x.n if pd.notna(x) else None)
)

# Filter to only positive months (activity after signup)
merged = merged[merged['months_since_signup'] >= 0]

# Count unique users per cohort and month
retention = merged.groupby(['cohort_month', 'months_since_signup']).agg(
    active_users=('user_id', 'nunique')
).reset_index()

# Get cohort sizes
cohort_sizes = users.groupby('cohort_month').size().reset_index(name='cohort_size')

# Merge and calculate retention rate
retention = retention.merge(cohort_sizes, on='cohort_month')
retention['retention_rate'] = (retention['active_users'] / retention['cohort_size'] * 100).round(1)

# Pivot for display
retention_matrix = retention.pivot(
    index='cohort_month', 
    columns='months_since_signup', 
    values='retention_rate'
).fillna(0)

print(retention_matrix)

Output:

months_since_signup    0      1      2
cohort_month                          
2025-01               100.0   50.0   50.0
2025-02               100.0   50.0    0.0
2025-03               100.0    0.0    0.0
Challenge 3: Implement a Simple A/B Test Analysis

Difficulty: ๐Ÿ”ด Hard

Data:

experiment = pd.DataFrame({
    'user_id': range(1, 2001),
    'variant': ['control']*1000 + ['treatment']*1000,
    'converted': [0]*920 + [1]*80 + [0]*890 + [1]*110  # 8% control, 11% treatment
})

Task: Calculate the conversion rate, lift, confidence interval, and p-value.

โœ… Solution
import pandas as pd
import numpy as np
from scipy import stats

# Calculate stats per variant
summary = experiment.groupby('variant').agg(
    users=('user_id', 'count'),
    conversions=('converted', 'sum'),
    conversion_rate=('converted', 'mean')
).reset_index()

print("Summary:\n", summary)

# Extract values
control = summary[summary['variant'] == 'control'].iloc[0]
treatment = summary[summary['variant'] == 'treatment'].iloc[0]

p_control = control['conversion_rate']
p_treatment = treatment['conversion_rate']
n_control = control['users']
n_treatment = treatment['users']

# Calculate lift
lift = (p_treatment - p_control) / p_control * 100
print(f"\nLift: {lift:.1f}%")

# Pooled proportion and standard error
p_pooled = (control['conversions'] + treatment['conversions']) / (n_control + n_treatment)
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_control + 1/n_treatment))

# Z-statistic and p-value
z = (p_treatment - p_control) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))  # Two-tailed
print(f"Z-statistic: {z:.3f}")
print(f"P-value: {p_value:.4f}")

# 95% Confidence interval for the difference
diff = p_treatment - p_control
se_diff = np.sqrt(p_treatment*(1-p_treatment)/n_treatment + p_control*(1-p_control)/n_control)
ci_lower = diff - 1.96 * se_diff
ci_upper = diff + 1.96 * se_diff
print(f"95% CI for difference: [{ci_lower*100:.2f}%, {ci_upper*100:.2f}%]")

# Conclusion
if p_value < 0.05:
    print("\nโœ… Result is statistically significant at ฮฑ=0.05")
else:
    print("\nโŒ Result is NOT statistically significant at ฮฑ=0.05")

Output:

Summary:
     variant  users  conversions  conversion_rate
0   control   1000           80             0.08
1  treatment  1000          110             0.11

Lift: 37.5%
Z-statistic: 2.400
P-value: 0.0164
95% CI for difference: [0.57%, 5.43%]

โœ… Result is statistically significant at ฮฑ=0.05

๐Ÿ“š More Practice Resources

โš ๏ธ Common Technical Interview Mistakes

Mistake Why It Hurts Fix
Jumping straight to code May solve wrong problem Ask 2-3 clarifying questions first
Not testing edge cases NULLs, empty tables break queries Always consider: empty input, NULL values, duplicates
Overly complex solution Hard to debug and explain Start simple, optimize if needed
Silent coding Interviewer can't assess your thinking Narrate your thought process
Not validating output Logic errors slip through Trace through with sample data

โœ… Technical Interview Checklist

Before your interview:

  • โ˜ Practice 10+ SQL problems covering window functions, CTEs, JOINs
  • โ˜ Solve 5+ Pandas problems (groupby, merge, pivot)
  • โ˜ Know how to calculate sample size for an A/B test
  • โ˜ Can implement a two-sample proportion test in Python
  • โ˜ Practice explaining your code out loud
  • โ˜ Have your IDE/editor ready with sample data
60 mins Advanced