Article

Why Contract Extraction Needs Offline Evaluation

A practical engineering note on why contract extraction should be measured with golden sets, field-level evaluation, evidence checks, error attribution, and version comparison instead of only tuning prompts.

Updated: Jun 4, 2026
  • LLM
  • Information Extraction
  • Evaluation
  • Contracts

The easiest mistake in contract extraction is to treat every failure as a prompt problem.

Prompts do matter. Contract fields such as fee schedules, share classes, subscription and redemption rules, investment strategy, and benchmarks need clear definitions and strict output formats. But in a real enterprise system, the prompt is only one part of the extraction pipeline. The harder question is whether each change can be measured.

Without offline evaluation, iteration usually becomes reactive. A contract fails, someone adds another instruction to the prompt. Another sample fails, another rule is added. A few days later, an older sample breaks. The team may be fixing visible examples, but it cannot answer the more important question: did this version actually improve the system, or did it just move errors around?

Contract Extraction Has Many Failure Modes

Contract extraction is not summarization. The output is a structured set of fields that often flows into review tools or operational systems.

Typical errors include:

  • Wrong value: management fee, custody fee, and sales service fee are mixed up.
  • Wrong scope: the system extracts Class A terms but misses Class C terms.
  • Wrong unit or interpretation: annual fees, one-time fees, and tiered fees are flattened into the same format.
  • Wrong evidence: the value looks right, but the cited clause does not support it.
  • Wrong missing-value decision: the contract does not state the field, but the model fills in a plausible value.
  • Wrong merge logic: the same field appears in the body, appendix, and supplemental clauses, but the final result picks the wrong source.

These issues do not have the same root cause. The problem may be PDF parsing, chunking, field definitions, retrieval, model behavior, or post-processing. If the system has no evaluation harness, all of these failures get collapsed into “the model is inaccurate.”

That is not a useful diagnosis.

A Golden Set Is An Engineering Baseline

The first evaluation asset is a desensitized golden set.

It does not need to be large at the beginning. It does need to cover real edge cases: different contract formats, multiple share classes, fee expressions, repeated fields across the body and appendices, cross-page tables, missing fields, and fields that require conditional interpretation.

Each sample should include:

  • The expected field value, scope, unit, and nullability.
  • The evidence span or source location that supports the answer.
  • Annotation notes explaining why the answer is correct and what should not count as correct.

Many teams only label values. That is faster at first, but it makes later debugging harder. A correct-looking value does not prove the extraction path was correct. The model may have guessed it from another clause or learned a common pattern from similar contracts.

Evidence validation changes the standard from “the answer seems plausible” to “the answer is supported by this contract.”

Field-Level Evaluation Beats A Single Document Score

A contract may contain many field types. Some are easy. Some are business-critical. A single pass/fail score for the whole document hides the details that engineers need.

Field-level evaluation is more useful:

  • Track accuracy, recall, and missing rates per field.
  • Apply stricter thresholds to high-risk fields.
  • Separate exact match, normalized match, partial match, and uncertain cases.
  • Compare multi-value fields as sets instead of raw strings.
  • Normalize rates, dates, and amounts before comparison.

For example, the same annual fee can be written as a percentage, spelled-out text, or clause-specific wording. Different fee categories may also use similar phrasing. The evaluation script has to understand these differences instead of relying only on string equality.

Evidence Checks Reduce Hallucination

For contract extraction, I prefer making evidence part of the output contract.

If a field has no supporting evidence, the system should lower confidence or route it to manual review, even when the value looks reasonable. Evidence checks should answer three questions:

  • Did the evidence come from the current contract?
  • Does the evidence contain or support the extracted value?
  • Does the evidence support the right field, rather than a nearby or similar clause?

This is not just explainability for presentation. It exposes errors that value-only evaluation can miss. A model may extract the right number while citing the wrong clause. Or it may cite a custody fee clause while filling the management fee field.

It also makes human review faster. Reviewers can inspect the proposed source directly instead of searching through the full document again.

Error Attribution Tells You What To Fix

Offline evaluation should not stop at scoring. The useful part is attribution.

The same incorrect field can require very different fixes:

  • Parsing error: PDF text is missing, table order is wrong, or headers and footers pollute the content.
  • Localization error: the relevant clause never reached the model context.
  • Field definition error: annotation rules and prompt instructions disagree.
  • Model judgment error: the right context is present, but the model chooses the wrong value.
  • Post-processing error: deduplication, merging, unit conversion, or formatting changes the answer incorrectly.
  • Golden label error: the expected answer is wrong and the annotation should be fixed.

Without attribution, teams keep adding prompt rules for problems that should be solved in parsing, chunking, post-processing, or annotation.

Version Comparison Matters More Than One Score

Contract extraction systems change over time. Teams adjust chunking, add fields, revise prompts, replace parsing logic, update merge rules, or switch model providers. Every change can create regressions.

Evaluation should support version comparison:

  • Which fields improved in this version?
  • Which previously correct fields regressed?
  • Are regressions concentrated in one contract type or expression pattern?
  • Are high-risk fields still stable?
  • Did the change increase cost or latency beyond an acceptable range?

This changes the engineering conversation. Instead of saying “the new prompt feels better,” the team can say “this version fixed several historical fee extraction errors, but introduced regressions for multi-share-class contracts because the merge step flattened class-specific values.”

That is the level of feedback needed for reliable iteration.

Prompts Still Matter, But They Need Guardrails

Prompt design is still part of the work. Field definitions, counterexamples, output schemas, missing-value rules, and evidence requirements should be explicit.

But prompt changes should happen inside an evaluation loop. After each change, the team should see field-level metrics, evidence results, attributed errors, and version diffs. Otherwise, the process is driven by intuition instead of engineering feedback.

For high-value contract extraction, the iteration loop I trust looks like this:

  1. Build a desensitized golden set.
  2. Define expected values and evidence rules for each field.
  3. Run field-level evaluation, not just document-level checks.
  4. Attribute errors to parsing, localization, model behavior, post-processing, or labels.
  5. Compare versions and make regressions visible.

This is slower than editing a prompt in the moment, but it makes the system more stable over time. Without it, contract extraction can look good in a demo while still leaving reviewers unsure which fields they can trust.

If you are evaluating an enterprise RAG, knowledge base, AI support, or agent workflow project, contact me by email at contact@aildnc.com. You can also reach me on Telegram at @NieErAI.

Contact

Book a 30-minute technical diagnosis

Share the business context first. I will help assess whether the AI application is worth building, how to approach it, and where the main risks are.

Telegram @NieErAI Message me on Telegram