Product Analytics Agent Framework

The Problem with Single-Pass Answers

Product analytics interview questions are intentionally vague: "Instagram engagement dropped 10% — investigate." A common failure mode is jumping to conclusions: "It's probably a bug" or listing metrics without structure. Interviewers are evaluating how you think, not just what you know.

The solution is to treat your answer as an orchestrated workflow: break the problem into specialist sub-tasks, execute each with discipline, then synthesize into a recommendation. This mirrors how senior data scientists actually work.

The Four-Agent Mental Model

Think of your analytical response as four specialist "agents" operating in sequence. In a real AI system (or the MCP server's run_product_analytics_framework tool), these run in parallel — in your interview answer, you execute them sequentially out loud.

Agent	Question it answers	Output
1. Orchestrator	What type of problem is this? What framework applies?	Framework selection (HEART vs AARRR), scope definition
2. Metric Definer	What metrics matter? What do we protect?	Primary metric, counter-metrics, segments to analyze
3. Experiment Designer	If we test a fix, how? What are the risks?	Randomization unit, duration, network-effect risks, guardrails
4. Synthesis Agent	What do we recommend and how do we communicate it?	Investigation order, root cause hypotheses, decision criteria

Framework Selection: HEART vs AARRR vs North Star

When to Use HEART

HEART (Happiness, Engagement, Adoption, Retention, Task Success) was developed by Google UX Research. It is best for evaluating existing feature quality and user experience improvements.

Use when: diagnosing engagement drops, evaluating a redesign, improving retention
Strength: covers both quantitative metrics (retention, engagement) and qualitative signals (happiness, task success)
Meta context: core to product sense interviews — "How would you measure the success of Stories?"

Dimension	What it measures	Example signals
Happiness	User satisfaction and sentiment	NPS, CSAT, app store rating
Engagement	Frequency and depth of use	DAU, sessions/day, actions/session
Adoption	New users reaching core value	Feature adoption rate, time-to-first-use
Retention	Users returning over time	D7/D30 retention, churn rate, stickiness (DAU/MAU)
Task Success	Users completing intended goals	Completion rate, error rate, funnel conversion

When to Use AARRR

AARRR (Acquisition, Activation, Retention, Referral, Revenue) is the growth framework. Use it for evaluating growth levers, new market expansion, or monetization decisions.

Use when: evaluating a new market launch, improving onboarding, increasing virality, monetization analysis
Strength: maps the full user lifecycle from discovery to revenue
Meta context: common in "How would you grow WhatsApp in India?" type questions

Stage	Key Question	Core Metrics
Acquisition	How do users find us?	Installs, signups, CPA by channel
Activation	Do users experience core value?	Onboarding completion rate, time-to-aha-moment
Retention	Do users come back?	D7/D30 retention, churn rate, resurrection rate
Referral	Do users bring others?	Viral coefficient (K-factor), invite acceptance rate
Revenue	Do we monetize effectively?	ARPU, LTV, conversion to paid, ARPDAU

North Star Metric

Every product has one metric that best captures its core value delivery. For complex interviews, anchor your entire answer to the relevant North Star — then explain how your proposed actions affect it.

Product	North Star	Why
Facebook Feed	DAU/MAU (stickiness)	Captures daily habit formation at scale
Instagram Reels	Watch-through rate × share rate	Both consumption quality and viral distribution
WhatsApp	Messages sent per DAU	Depth of engagement, not just presence
Marketplace	Successful transactions per MAU	End-to-end value delivery (listing → sale)

Worked Example: End-to-End Answer

Prompt: "Instagram Stories engagement is down 10% week-over-week. Investigate."

Agent 1 — Orchestrator: Frame the problem

"Before diving in, I want to clarify a few things: Is this a rate drop (engagement per story view) or an absolute drop? Is it global or region/platform specific? And what's the measurement window? I'll assume engagement rate = reactions + replies per story impression, global, last 7 days vs. prior 7 days."

"This is a diagnostics problem. I'll use the HEART framework with a focus on the Engagement and Task Success dimensions."

Agent 2 — Metric Definer: Define what to measure

Primary metric: Story engagement rate = (reactions + replies) / story impressions

Counter-metrics (guardrails): Story creation rate (did creators stop posting?), story view rate (did reach change?), spam rate (quality signal)

Segments to break down: Platform (iOS/Android/Web), country, user cohort (new vs. established), content type (text vs. photo vs. video)

Leading indicators to check: Story creation rate (if creators dropped, impressions follow), notification delivery rate (push notifications drive story views)

Agent 3 — Experiment Designer: Plan a test if we identify a fix

"If our investigation identifies a fixable cause — say, the reaction tray is harder to reach in a new UI — we'd A/B test the fix. Key considerations:"

Randomization unit: User-level (not story-level), since story consumption is tied to individual user behavior patterns

Network effects: Stories are social — if I see fewer stories (treatment), my network activity also changes. Use cluster randomization by social graph partition to minimize spillover

Duration: Minimum 2 weeks to capture the full weekly usage cycle and control for novelty effects

Decision criteria: Ship if engagement rate improves ≥ 2% (MDE) with p < 0.05, no regression in story creation rate or spam rate

Agent 4 — Synthesis: Order of investigation and recommendation

Check data pipeline first: Is the drop in the data or the product? Verify event ingestion lag and logging completeness

Pinpoint timing: Plot daily engagement rate over last 30 days. If the drop is a step-change on a specific date → look for releases or infra changes

Segment isolation: Run segment breakdown (platform × country). If 100% of the drop comes from Android users in Europe → likely a release bug

Funnel diagnosis: Did story views drop (reach problem) or did reactions per view drop (interaction problem)? These have different fixes

Root cause hypotheses (ordered by likelihood):

Product release changed the reaction UI → test on that release date

Algorithm change reduced story distribution → check organic reach rate

Seasonal behavior change → compare to same week prior year

Competitor launch pulling engagement elsewhere → cross-app usage data

Communication: "We observed a 10% drop in story engagement rate starting [date], concentrated in [segment]. Our primary hypothesis is [cause] because [evidence]. We recommend [action] and will monitor story creation rate as a guardrail."

AI Agent Orchestration in the MCP Server

How This Maps to the `run_product_analytics_framework` Tool

The MCP server implements this four-agent pattern as a single orchestrated tool. When you call run_product_analytics_framework, it fans out to three specialist sub-components (metric definition, experiment design, diagnostic SQL) and then synthesizes the outputs.

-- Example MCP tool call (via Claude or VS Code):
{
  "name": "run_product_analytics_framework",
  "arguments": {
    "question": "Instagram Stories engagement is down 10% week-over-week",
    "product_area": "engagement",
    "framework": "HEART",
    "include_sql": true
  }
}

The tool returns:

Metric framework — HEART dimensions relevant to engagement, with primary signals and guardrails
Experiment design — Randomization unit, network-effect risks, decision criteria
Diagnostic SQL templates — Time-series, segment breakdown, funnel, cohort comparison queries
Synthesis — Ordered investigation steps and communication template

Additional Specialist Tools

Tool	When to Use
`define_product_metrics`	Only need metric definition (e.g., preparing for a metrics-focused interview round)
`design_product_experiment`	Only need experiment design (e.g., evaluating a specific A/B test plan)
`generate_diagnostic_sql`	Only need SQL templates (e.g., practicing diagnostic queries)
`design_ab_experiment`	Statistical experiment design with sample size calculation (from A/B testing tools)

Why Orchestrator Pattern (Not Pipeline)

Metric definition, experiment design, and SQL generation are independent tasks — they don't depend on each other's outputs. The orchestrator pattern runs them in parallel and passes all results to the synthesis step, which has a dependency on all three. This is faster and more modular than a linear pipeline where each step must wait for the previous one.

User Question
      │
      ▼
 Orchestrator Agent   ← classifies problem, selects framework
      │ fans out (parallel)
 ┌────┴────────────────────────┐
 ▼              ▼              ▼
Metric       Experiment    SQL Query
Definer      Designer      Generator
 └────┬────────────────────────┘
      │ aggregates
      ▼
 Synthesis Agent   ← ordered investigation + communication plan

Framework Quick Reference

Which Framework for Which Question?

Question Type	Framework	Key Dimensions to Emphasize
"Metric X dropped — investigate"	HEART	Engagement, Task Success (funnel), then Happiness
"Measure success of feature Y"	HEART	Adoption, Engagement, Retention (post-feature)
"Grow product Z in new market"	AARRR	Acquisition, Activation, Referral
"Should we add monetization?"	AARRR + guardrails	Revenue vs. Retention trade-off
"Define the North Star for X"	North Star	Value delivery, frequency, breadth of impact
"Design an A/B test for Y"	Experiment Design	Randomization, network effects, duration, MDE

Network Effect Risks — Cheat Sheet

Risk	What It Is	Mitigation
Interference / Spillover	Treatment users interact with control via social graph	Cluster randomization by social graph partition or geography
Novelty Effect	Engagement spike from excitement, not real lift	Run 2–4 weeks; analyze engagement by days-in-experiment
Primacy Effect	Users resist change initially, then adapt	Segment by days-in-experiment; look for behavior stabilization
Sample Ratio Mismatch	Groups aren't the expected size → logging bug or selection bias	Chi-square test on group sizes within 24h of launch
Multiple Testing	Many metrics → inflated false positive rate	Pre-register primary metric; Bonferroni correction for secondary metrics