Recensa

Benchmark methodology

How Recensa evaluates multi-model document assurance: principles for whole-document tasks, severity rubrics, and honest partial-run reporting—and an internal harness that applies them across document types. A living framework, not a leaderboard.

Last updated 2026-05-14

How to read this page

Research map

What you can learn here

  • Task realism

    Whole-document behaviors—not toy sentences alone.
  • Rubrics & severity

    Material vs nit; meaning-preserving edits.
  • Evidence linkage

    Grounded anchors vs invented citations.
  • Disclosure

    Failure modes and partial runs—not headline accuracy only.

Technical detail

Benchmark design

What should be measuredWhole documents, severity, evidence, and operational stress.
  • Task realism: cross references, definitions, and section-level behavior—not isolated toy sentences.
  • Severity calibration: material vs nit, and whether suggested edits preserve meaning.
  • Evidence linkage: whether systems stay grounded when exhibits matter.
  • Operational stress: long inputs, partial provider failures, and honest partial outputs.
Why benchmark design mattersRaw scores hide prompt leakage, dataset overlap, and rubric gaming.

A credible benchmark states tasks, datasets, rubrics, and adjudication up front—and reports failure modes, not only headline accuracy. Enough detail should exist that a third party could attempt to reproduce the harness.

Recensa stance on evaluationInternal harness practice; structured disclosure for anything published.

Recensa operates an internal evaluation harness that applies these principles—not as a published third-party benchmark, but as ongoing product methodology. Whole-document fixtures across legal, academic, business, and contract-style samples carry seeded defects with ground truth; automated scoring runs against the same multi-model Document Check pipeline customers use—three independent reviewers on Claude, GPT, and Gemini, reconciled by an arbiter—with honest partial-run reporting when a provider is unavailable.

Internal runs emphasize whole-document realism, severity calibration, evidence and citation boundaries, and failure-mode visibility. They inform product development; they do not constitute a neutral vendor comparison and are not published here as comparative scores.

Any external benchmark we publish should disclose what ran, what failed, and what you adjudicated—without comparative vendor scores on this page.

What may ship laterOnly after real, completed runs with disclosure.
  • Task mix results with confidence intervals—not single-point leaderboards.
  • Failure galleries where models disagreed or partial quorum applied.
  • Open prompts and scoring notes for replication attempts.