Fund Contract Extraction and Offline Evaluation Pipeline | Nie Er

This project was built as an internal engineering pipeline, not as a public SaaS product or a prompt-only prototype. The input is a private fund contract PDF. The output is a set of reviewable structured fields, source evidence, and conflict markers, so operators can verify extracted results instead of reading the full contract from scratch.

Problem

Private fund contracts are long, inconsistent, and dense with business rules. The target output is not a generic summary. It is a set of fields that can support downstream review and data entry, such as share-class structure, fee terms, investment strategy, benchmark references, and subscription or redemption rules.

The hard parts are mostly operational:

Contracts do not follow one fixed template, and the same field may appear under different labels.
Important details can be spread across the main body, appendices, share-class descriptions, and nearby tables.
Numbers, dates, fee rates, and share classes require careful handling because small mistakes can change the business meaning.
Review teams need traceability, so every important field should point back to supporting contract text.

Stack

The pipeline is implemented mainly in Python. PyMuPDF is used for PDF text and layout extraction. LLM calls handle the field extraction work, while the surrounding engineering focuses on chunking, prompt variants, result merging, normalization, evidence tracing, and offline evaluation.

There is no public repository or hosted demo for this project, so the frontmatter intentionally omits repoUrl and demoUrl. The public GitHub profile is available at GitHub profile.

Architecture

The pipeline is split into several modules:

PDF parsing Extract text, page numbers, and basic layout signals from contract PDFs. Since these documents are mostly text-heavy contracts, a lightweight parser is preferred over a heavier OCR-first workflow.
Multi-window chunking Split the contract with several page-window sizes and overlapping pages. This helps preserve cross-page tables, late-section clauses, and appendix-based field definitions.
Field extraction Group fields by business topic and ask the model to extract structured values only from the supplied text range. The response includes the field value, page reference, and supporting source text. Sensitive fields can be processed by two prompt variants for cross-checking.
Fusion and cleanup Merge candidates from different chunks and prompt variants. The cleanup layer handles duplicates, unit normalization, format hallucinations, and conflicting numeric values. Ambiguous results are kept as conflicts instead of being silently overwritten.
Offline evaluation A de-identified golden set is used to compare extraction output with human labels. Evaluation is done at field level, so prompt, chunking, and post-processing changes can be measured before release.

My Role

I worked on the extraction pipeline design and implementation, including:

PDF parsing and multi-window chunking strategy.
Prompt design for structured field extraction.
Candidate merging, deduplication, unit normalization, and conflict marking.
Source evidence tracing for important fields.
Offline evaluation scripts for comparing pipeline versions.
Iteration of prompts, chunk sizes, and post-processing rules based on evaluation output.

Engineering Challenges

The full contract cannot reliably fit into one model call.
Instead of truncating the document, the pipeline uses multiple overlapping page windows. This improves recall while keeping model calls bounded and reduces the chance of breaking cross-page context.

One field may have several valid-looking candidates.
For example, fee terms can differ by share class, and appendix language can refine or override the main body. The pipeline preserves the field, class, source location, and conflict state before applying merge rules or sending the item for human review.

A well-formatted answer is not enough.
The extraction output must be tied to source evidence. For high-risk fields, two prompt styles are compared: agreement increases confidence, while disagreement becomes an explicit conflict.

Evaluation needs more than one aggregate score.
Field importance and failure modes vary. The offline evaluator separates issues such as labeling problems, parsing failures, missed retrieval, field interpretation errors, and normalization mistakes.

Offline Evaluation

The evaluation workflow uses de-identified contract samples and human-labeled expected outputs. Each run checks whether a field was found, whether the extracted value matches or is acceptably equivalent, and whether the evidence location supports the answer.

This makes iteration less subjective. A prompt change might improve fee recall but introduce more share-class conflicts. The evaluator exposes that trade-off at field level instead of relying on manual spot checks.

Delivery

The delivered artifact is a backend extraction workflow that can be integrated into an internal business system. A single 80–150 page contract (~100k characters) goes from upload to a full set of fields in about 2–3 minutes, with extraction covering 13+ core field types; high-risk modules are cross-checked with two prompt variants at roughly 1.6× the single-prompt token cost. Typical outputs include structured fields, source evidence, confidence and conflict markers, review-needed items, processing logs, and evaluation reports.

If you are evaluating an enterprise RAG, knowledge base, AI support, or agent workflow project, contact me by email at contact@aildnc.com. You can also reach me on Telegram at @NieErAI.

Fund Contract Extraction and Evaluation Pipeline