JOURNAL

How evidence grounding works

Evidence grounding is the architectural difference between a synthetic persona that reasons like a real audience and one that improvises plausible-sounding answers.

In Candor, evidence grounding means every attribute on every persona traces back to a real source: published research, your uploaded documents, validated population distributions, or flagged inference. The audit trail is visible to anyone reviewing the study. This piece walks through how the pipeline actually works.

If you want the opinionated version of why this matters, read why synthetic research needs evidence grounding. This piece is the technical companion: what the system does, step by step, and what the methodology gives you.

What evidence grounding means in practice

Most "AI persona" tools work the same way. The user describes an audience. The AI invents a character that fits the description. The output sounds plausible because language models are pattern matchers, and the patterns of "a 35-year-old marketing director" exist in training data abundantly enough to produce a fluent character on demand. There is no link between the persona's attributes and any specific evidence about the actual audience.

Evidence grounding flips that. Before any persona is generated, Candor retrieves real evidence about the target audience: published research, peer-reviewed studies, market reports, any first-party documents the customer uploaded, and validated behavioral and demographic distributions. The personas are constructed from that evidence, not from a prompt alone. Each attribute carries a tag showing which source informed it and at what confidence level. When a persona answers an interview question, the underlying reasoning traces back to evidence that exists in the system, not to the AI's pattern-matched best guess.

The practical difference shows up in three places. First, the personas reason about pain points, beliefs, and behaviors in ways that reflect what the actual audience has been documented to say, not the AI's compressed average of every "marketing director" in training data. Second, the methodology is auditable: a UXR lead or insights director can inspect the provenance and push back where the evidence is thin. Third, when the audience evolves (new market entrants, shifting attitudes, new behavioral data), re-running the evidence stage produces personas that reflect the updated reality, not the same generic pattern.

The rest of this piece walks through the five stages that produce that result.

Stage 1: Evidence retrieval. Where the data comes from

The first stage of every Candor study is evidence retrieval. At study setup, the user describes the target audience: B2C or B2B context, the learning goals they want to address, the industry, and the region. They can also upload any first-party research they already have. The specific interview type is selected later, when the interview guide is generated, and value-prop testing and assumption validation are layered formats that ride on top of the chosen interview type. From the initial audience input, Candor builds a search plan and runs two parallel retrieval workflows.

Uploaded documents come first. If the customer has uploaded research (PDF, DOC, DOCX, CSV, TXT, or MD files: customer interview transcripts, survey data, market reports, prior research deliverables), Candor parses and indexes those documents. They become the primary evidence source. Internal research about your specific audience is more relevant than general public evidence about the broader category, so uploaded documents get weighted higher in retrieval.

The documents are parsed into chunks, embedded into a vector space, and stored in a searchable index. When the pipeline needs evidence about a specific dimension of the audience (their pain points, their decision criteria, their attitudes toward category X), it retrieves the relevant chunks via vector search.

Web search runs alongside the documents. In parallel with the document layer, Candor runs structured web searches across published research, industry analyses, peer-reviewed studies, public consumer-health and market-behavior datasets, and any other source likely to contain evidence about the audience. The search strategy is iterative rather than batch: the pipeline runs a broad initial pass, then assesses what's been retrieved, identifies coverage gaps, and runs targeted follow-up searches to fill them.

The web layer is critical when the uploaded documents are thin or when the audience extends beyond what the customer has researched directly. For pre-launch products, healthcare populations that are hard to recruit, regulated industries, and any audience where first-party research is incomplete, the web layer is often the bulk of the evidence.

Both layers feed the same evidence pool, with provenance distinguishing them. When a persona attribute later traces back to "Source A," the provenance tag indicates whether Source A was a customer-uploaded document or a public source retrieved by Candor. This matters at the synthesis stage, where the customer can see which findings rest on their own data versus public evidence.

Stage 2: Signal extraction. Turning raw evidence into structured persona attributes

Retrieved evidence by itself is not yet usable. A PDF of a customer-interview transcript or a peer-reviewed paper about consumer behavior is unstructured text. To generate personas from it, Candor first extracts structured signals from the evidence.

A signal is a structured representation of one piece of behavioral or attitudinal information about the audience. Candor extracts hundreds of signals per study across several categories:

  • Behaviors: what the audience actually does (purchase patterns, app usage, communication preferences, decision-making sequences)
  • Pain points: what frustrates them, what they avoid, what they actively complain about
  • Attitudes: what they believe, what they're skeptical of, what they value
  • Constraints: what limits their choices (budget, time, regulatory, organizational, capability)
  • Goals: what they're trying to accomplish, both stated and implied
  • Beliefs: what they assume to be true about the category, the alternatives, the world
  • Preferences: what they prefer when given choices, including the reasoning behind the preference
  • Decision rules: how they actually decide, including shortcuts and heuristics

Each signal is extracted from a specific piece of evidence and tagged with metadata: which source it came from, which category it belongs to, what confidence level the extraction has, and what segment(s) of the audience it applies to.

The extraction is run by language models with structured-output validation, so the signals come out in a consistent schema rather than as free text. This matters because the next stages (segmentation, persona generation, interview behavior) all operate on the structured signal representation, not on raw evidence.

Signal extraction is also where the pipeline starts to see the shape of the audience: which segments cluster around similar pain points, which signals are universal versus segment-specific, where the evidence is dense versus thin. That shape feeds the next stage.

Stage 3: Provenance tagging. The audit trail at the attribute level

This is the stage where Candor's approach to rigor becomes visible in the product surface. Every signal extracted from evidence (and every attribute later derived from those signals on individual personas) carries a provenance tag. There are five:

  1. Grounded: the attribute traces directly to a specific source in the evidence pool. The customer can click through to the underlying citation. This is the highest-confidence tag; the attribute reflects something that was documented in the audience's actual behavior or attitudes.
  2. Inferred: the attribute was extrapolated from a behavioral pattern documented in the evidence. The pattern is grounded; the specific extrapolation is the inference. This is common when the evidence supports a general behavior and a specific persona instance follows from it.
  3. Calibrated: the attribute was drawn from a validated statistical distribution (a peer-reviewed population distribution for personality traits, a research-backed range for a behavioral parameter). The persona's specific value is a sample; the distribution it was sampled from is calibrated.
  4. Sampled: the attribute was sampled at random because no specific evidence was available and the value doesn't depend on validated distributions. This is honest noise.
  5. Weak-confidence: the attribute is grounded or inferred, but the underlying evidence is sparse enough that the customer should treat it as a hypothesis worth real-customer validation rather than as a finding.

The provenance tag is visible on every persona attribute and on every signal cited in the synthesis report. When a synthesis finding reads "Personas in Segment A consistently described claims-process friction as their top frustration," the customer can drill into the underlying signals, see which sources contributed, and see the confidence level of each contributing signal.

This is the auditability that separates synthetic research from AI improvisation. A UXR lead reviewing a study can challenge the methodology at the attribute level, not just at the conclusion level. An insights director defending a launch decision to leadership can cite the provenance chain rather than relying on the team's word that the AI was "trained on the right data." A skeptical stakeholder asking "where did this finding come from" gets a real answer with a real source, not a wave at the model.

Stage 4: Iterative gap-fill. Closing the holes

Evidence retrieval doesn't return uniform coverage. Some audiences have rich published research and dense first-party data; others have sparse coverage in one segment or across one signal category. Candor's pipeline assesses coverage after the first retrieval pass and runs targeted follow-up searches to fill identified gaps.

The gap-fill iteration looks at:

  • Signal-category balance: are some categories (behaviors, pain points, attitudes, constraints, etc.) under-represented relative to others? If so, run searches biased toward the under-represented categories.
  • Segment coverage: are all the defined audience segments equally represented in the evidence pool? If one segment is thin, run searches targeted at that segment.
  • Diversity within signals: are the retrieved signals reflecting a diverse range of perspectives, or are they clustering around a single source's framing? If the latter, run searches to broaden the perspective base.
  • Recency: is recent evidence present, or is the retrieved material older than is useful for the question? If older, run searches biased toward recent sources.

Each gap-fill iteration runs targeted web searches against the identified gaps, extracts signals from the new evidence, and re-assesses coverage. The loop continues until either (a) the gaps are sufficiently closed, or (b) the search budget for the study is exhausted.

This iterative pattern is why audience generation in Candor takes 25 to 35 minutes (rather than seconds): the system is actually doing meaningful evidence work, not generating a persona from a prompt. The minutes are spent retrieving and extracting evidence, not spent waiting on slow infrastructure.

Stage 5: Critic review. Quality control before persona generation

Before the evidence pool is handed off to persona generation, a critic agent reviews the assembled evidence and signal set for quality and diversity. The critic is a separate AI agent with a structured rubric covering:

  • Diversity of perspectives: are multiple viewpoints represented, or is the evidence biased toward one framing of the audience?
  • Signal redundancy versus coverage: has the gap-fill iteration produced redundant signals (the same finding from too many sources, with diminishing marginal information) or genuine breadth?
  • Provenance health: is the evidence pool dominated by weak-confidence signals (which would produce hypothesis-grade personas rather than research-grade personas), or is there a healthy proportion of grounded and calibrated signals?
  • Segment differentiation: do the segments actually differentiate from each other on the signal evidence, or are they overlapping enough that the segmentation may not be meaningful for the study?

When the critic flags issues, the pipeline either runs additional retrieval to address them or returns the issues to the synthesis stage as caveats the customer should see in the final report. The critic isn't a binary pass/fail gate; it's a quality-signal layer that surfaces concerns the customer should know about before drawing conclusions.

The critic step also catches a class of failure modes that earlier synthetic research tools regularly shipped without noticing: when the evidence pool contains internal contradictions, when one source dominates the signal pool disproportionately, or when the segmentation doesn't actually correspond to differentiated signal patterns. These aren't catastrophic failures; they're quality issues that, surfaced honestly, let the customer decide whether to proceed, re-run with different inputs, or escalate the question to real-customer research.

What evidence grounding gives you (and what it still can't do)

The cumulative effect of these five stages is research with an audit trail. Specifically:

Personas that reason about the audience's real patterns. Pain points, beliefs, and decision rules in the synthetic interviews track to documented evidence about the audience, not to AI-pattern-matched stereotypes. This produces interview content that researchers familiar with the audience will recognize as authentic rather than dismiss as generic.

Findings that survive stakeholder scrutiny. When a synthesis report goes to a stage gate, the provenance chain is the answer to "where did this come from." Researchers can defend the methodology. Insights leads can challenge specific attributes rather than dismissing or accepting the whole study as a black box.

A methodology that adapts to new evidence. When the audience evolves, re-running the evidence stage produces personas grounded in the updated evidence, not in last year's snapshot. This is part of why synthetic research can model drift over time: the underlying evidence is reload-able.

A clear handoff to real-customer research. Synthetic research is at its best as a first-pass layer that sharpens what to study with real customers. The provenance trail makes this handoff concrete: the surviving hypotheses are documented, the evidence each one rests on is named, and the gaps the real research should address are explicit.

What evidence grounding still can't do:

It can't produce statistically-bounded point estimates. The signals are qualitative; the personas reason about the audience; the synthesis finds themes and tensions. None of this produces "27% of customers prefer X, plus-or-minus 3pp at 95% confidence." That number requires real-respondent panel research at sufficient N.

It can't substitute for real-customer ground truth. The evidence pool reflects what's been documented; real customers behave in ways that haven't been documented yet. Real-customer research surfaces the unexpected; synthetic research reflects the expected within evidence-supported bounds.

It can't make up for missing evidence. If the audience is genuinely under-researched (a brand-new category, a niche professional segment with no public coverage), evidence grounding only goes as far as the evidence does. The gap-fill iteration helps; it doesn't manufacture evidence that doesn't exist. In those cases, the critic surfaces the gap, and the responsible move is to lean harder on the customer's first-party data or treat the synthetic findings as hypothesis-grade.

It doesn't replace real human judgment about what the research means. Evidence grounding produces grounded findings. Strategic interpretation of those findings remains a human job. The teams getting the most value from Candor treat the platform as an evidence-acceleration layer and the strategic synthesis as their own work.

Common questions

The difference is structural, not stylistic. A prompt-driven persona has no link between its attributes and any specific evidence about the audience: the AI invents the character from training-data patterns. An evidence-grounded persona has every attribute tagged with the source that informed it, retrieved from real research documents (published or customer-uploaded) before generation. The audit trail is the difference. When a stakeholder asks where a finding came from, prompt-driven research has no answer; evidence-grounded research can show the source.

Audience generation in Candor takes roughly 25 to 35 minutes per study. That window covers the full evidence pipeline: document parsing, embedding, web search across multiple iterations of gap-fill, signal extraction, provenance tagging, and critic review. The time isn't slow infrastructure; it's the system doing meaningful evidence work. The user doesn't wait at the screen; the pipeline runs in the background and the user is notified when audience review is ready.

Customer-uploaded documents (PDF, DOC, DOCX, CSV, TXT, MD) for first-party research, transcripts, survey data, and any existing research deliverables. In parallel, Candor retrieves public evidence: peer-reviewed studies, market research, industry analyses, public health and consumer behavior datasets, and recent reporting about the audience. Both layers feed the same evidence pool, with provenance distinguishing which findings rest on customer data versus public sources. For data privacy, the recommendation is to anonymize uploaded documents before submitting: strip PII, remove patient and member identifiers, rely on de-identified research as input.

Grounded (attribute traces to a specific source in the evidence pool); inferred (extrapolated from a behavioral pattern documented in the evidence); calibrated (drawn from a validated statistical distribution, like a peer-reviewed population distribution); sampled (drawn at random because no specific evidence applies); weak-confidence (grounded or inferred but with sparse underlying evidence, treat as hypothesis-grade). Every persona attribute and every cited signal carries one of these tags, visible in the persona profile and the synthesis report.

Yes. Every attribute on every persona is tagged with its provenance, and grounded attributes cite the specific source they trace to. The audit trail is visible in the persona profile and in the synthesis report. UXR leads inspecting a study can drill from a synthesis finding into the underlying signals into the source citations.

The pipeline's gap-fill iteration runs targeted searches to address identified gaps, and the critic agent surfaces remaining gaps before persona generation. If the evidence is genuinely thin (a brand-new category, a niche segment with no public coverage), Candor will say so rather than generate research-grade personas from sparse data. The personas will be flagged as weak-confidence on the affected dimensions, and the synthesis report will note the gap. The responsible path is to either lean on customer first-party research, treat the findings as hypothesis-grade, or move the question to real-customer research where the evidence is sparse.

Candor is in development.

Be the first to know when it launches.

No spam. Just a note when Candor is ready. Powered by Highline Beta.