Abhishek Gawde

Two Ways to Feed a Document to a Language Model

Abhishek Gawde — Fri, 08 May 2026 19:47:01 GMT

The question of how much of a document a language model should see is deceptively simple. It turns out to be one of the more consequential decisions in an extraction pipeline, and the tradeoffs run in both directions.

There are two approaches. You can retrieve a small number of targeted chunks and send those. Or you can send the whole document. Both work in some situations. Both fail in others.

Chunk-and-retrieve

The document is split into chunks at indexing time. Each chunk covers one clause, one section, one table. When an extraction job runs, it constructs a query and retrieves the chunks that are most relevant to what it is looking for. The model sees those chunks, not the whole document.

The appeal is focus. The model processes a small, targeted input. Cost scales with the number of chunks retrieved rather than total document length. Attention is concentrated on the right text rather than spread across hundreds of pages.

The limitation is recall. Retrieval has a ceiling. A Snowflake study on financial filings found that retrieval and chunking strategy were larger determinants of answer quality than the generating model itself, and that chunk size and top-k choices had material effects on what the model could find. If the relevant clause ranks below whatever cutoff the retrieval step uses, it is not in the input. The model cannot extract something it was never shown.

Full-document extraction

Skip retrieval. Send the whole document. The model reads everything and extracts from the complete text.

This removes the recall ceiling. Every clause is present. Nothing is excluded by retrieval ranking.

The problem is attention. A well-known 2024 paper by Liu et al., published in the Transactions of the Association for Computational Linguistics, examined how language models use long contexts across multi-document question answering and key-value retrieval tasks. It found that performance degrades significantly depending on where relevant information sits in the context: models tend to use information near the beginning and end of long inputs more effectively than information in the middle. The effect is U-shaped. The relevant clause being on page 40 of a 100-page document is different from the relevant clause being on page 3.

This is sometimes called the “lost in the middle” problem. It is not a bug so much as a structural pattern in how attention distributes across long sequences, with connections to serial position effects studied in human memory research.

The practical implication: for short documents, full-document extraction works well. Everything fits comfortably in context and the positional effects are less severe. For long documents, the results are less predictable. Facts in the middle of a 150-page contract are at higher risk of being missed than facts near the start or end, regardless of whether the model technically has access to them.

Full-document extraction also creates a provenance challenge. With chunk retrieval, every extracted fact traces back to a specific chunk with a clause number and page reference, because the chunk metadata carries that information. With full-document extraction, the model generates citations itself. Those citations are sometimes correct, sometimes not. For systems where precise provenance on every extracted fact is a hard requirement, this is a real constraint.

The tradeoff is not one-dimensional

An arxiv paper from 2024 re-examining RAG in the era of long-context models found something worth noting: with an order-preserving retrieval mechanism, RAG using only 16K retrieved tokens outperformed models using full 128K context on benchmark tasks. Long-context capability did not eliminate the value of retrieval. It shifted the question from whether to retrieve to how to retrieve.

The same paper found that the order in which retrieved chunks are presented matters. Preserving the original document order of retrieved chunks, rather than ordering them by relevance score, improved extraction quality. The relevance-ranked order that retrieval systems default to is not necessarily the order that helps the model most.

Separately, research on chunking strategies consistently finds that the boundaries matter. Semantic chunking, which splits on meaning rather than token count, tends to outperform fixed-size chunking for complex documents. Legal documents in particular have a specific challenge: a clause may be short but depend on a definition twenty pages earlier, or a qualifier in an annex. Splitting on clause boundaries preserves the legal unit of meaning. Splitting on token count can break a clause mid-sentence.

The top-k value, how many chunks to retrieve per job, also has non-obvious consequences. Too low and recall suffers. Too high and the model’s attention gets diluted across irrelevant material, potentially reintroducing the same positional effects that full-document extraction suffers from.

What this means in practice

The choice between chunk-and-retrieve and full-document extraction is not a one-time architectural decision. It depends on document length, on which facts are being extracted, on how strict the provenance requirements are, and on what actually fails in practice.

For long, complex documents with a fixed extraction schema and a provenance requirement, chunk-and-retrieve is the more tractable starting point. The failure mode (retrieval missing the right clause) is diagnosable: you can log which chunks were retrieved for each extraction call and check whether the relevant clause was in the input. That is a concrete thing to inspect. The failure mode for full-document extraction on long documents (attention simply not reaching a clause in the middle) is harder to observe and harder to fix without changing the document or the model.

For short documents, or for extractions that require understanding how multiple clauses interact across a long document, the case for sending more context is stronger. The recall ceiling of retrieval becomes a more active constraint when cross-clause dependencies are common.

A hybrid approach that researchers and practitioners have explored is escalation: run chunk retrieval first, and escalate to full-document only when retrieval confidence is low and the document is short enough to fit cleanly in context. Whether this works better than either approach alone depends on the document distribution, and the results reported in the literature vary enough that it is not safe to assume the answer in advance.

The thing worth logging

The two failure modes look the same from outside the pipeline. An empty field or a low-confidence result. The cause is different.

A retrieval miss means the relevant chunk was not in the top-k. The model did its job on the input it received. The problem was upstream.

An attention miss means the relevant chunk was in the input but the model did not extract from it. The model had access to the information and did not use it effectively.

Finding which one you have requires looking at which chunks were actually retrieved for each failed extraction. Without that log, both failure modes look identical and point to different solutions. It is the single most useful thing to instrument before trying to fix extraction quality.

I write about building document intelligence systems. If this was useful, follow along for the next piece.

How to Stop the LLM From Returning Whatever It Wants

Abhishek Gawde — Thu, 07 May 2026 19:37:53 GMT

Early in the pipeline I spent time writing validation logic to clean up what the LLM returned. Wrong types, missing fields, numbers formatted as strings. At some point I realised the validation was fixing a problem that should not have existed.

When you send a language model an extraction prompt, you get back text. The text might look like JSON. It might even be valid JSON. Whether it actually matches the schema you need is a separate question.

There are four approaches to closing this gap. They are not interchangeable, and they fail in different ways

Free-form + post-processing

The LLM returns text. Your code parses it after the fact.

This is the default if you do nothing special. The model returns JSON, mostly. Occasionally it adds a sentence before it. Occasionally a date field comes back as “1st January 2024” instead of “2024-01-01”. Your parser handles the common cases. The uncommon ones break silently or surface as exceptions much later.

The failure mode is that every edge case is invisible until it occurs. Across thousands of documents with dozens of extraction jobs, there will be cases you have not seen. They will find their way in eventually.

JSON mode

A lighter constraint available in most model APIs. The model is told to return valid JSON only: no preamble, no markdown fences.

This solves one problem: you reliably get parseable JSON. It does not solve the schema problem. The model can return valid JSON that does not match your schema: wrong field names, wrong types, missing required fields. A date might come back in one format on some documents and a different format on others. A required field might be missing if the model decided the document did not contain that information.

JSON mode reduces noise in post-processing but does not eliminate it. For quick prototyping, it is a reasonable starting point. For a production pipeline writing to a database, it leaves too much variability in place.

Function calling

You define a function signature with a JSON Schema. The model calls that function with arguments matching the signature.

The enforcement is real but partial. The model is directed to match the schema and does so more reliably than free-form or JSON mode. In my experience it was still possible to get violations on complex nested types, optional fields that were sometimes absent, and enum fields where the model returned something close to but not exactly one of the declared options.

Useful for simple schemas and tool-use patterns. Becomes less reliable at the edges: deep nesting, long lists, complex conditionals.

Structured Outputs

The strictest of the four. The schema is enforced server-side during the model’s token generation. The model is constrained at each step to only produce tokens consistent with the schema. A wrong type is not possible because the tokens that would produce it are never generated. A missing required field cannot happen for the same reason.

The practical consequence: schema violations go from something to validate against to something that does not occur. You do not need retry logic for format failures. The output either matches the schema exactly or the API call returns an error before any output is produced.

This matters at scale. Tens of thousands of LLM calls across a full pipeline. In a free-form or JSON mode setup, some fraction produce output that needs special handling. In a Structured Outputs setup, that fraction is zero.

The tradeoffs: the schema must be defined upfront as JSON Schema, passed to the API, and not all schema patterns are supported. Recursive schemas and some complex union types have constraints. For most extraction jobs the schema is straightforward enough that none of this is a problem.

What I ended up using

Structured Outputs for all extraction jobs, with schemas generated from the ontology rather than handwritten. Every extraction job has a corresponding Pydantic model. A codegen step converts those models to the JSON Schema format the API expects. Defining the schema in Python, testing it, and generating the JSON Schema from it is more maintainable than writing JSON Schema by hand.

The seven provenance fields on every extracted fact are mandatory in the schema. The model cannot return an extraction without them. Provenance is what makes version tracking and conflict resolution work, and if it is optional it will be missing in ways that are hard to detect and hard to fix after the fact.

If I were starting over: reach for Structured Outputs earlier. The early pipeline used JSON mode plus validation, and a meaningful fraction of engineering time went into handling format failures. Switching to Structured Outputs made that class of problem disappear.

The thing worth knowing

Free-form with post-processing is right when the output shape is variable and you need flexibility. JSON mode is a reasonable default when you need parseable output without a strict schema. Function calling works well for simple schemas. Structured Outputs is right when the schema is fixed, the output goes to a database, and schema violations at scale are not something you want to handle.

The question to ask: what happens when the output does not match what I expected? If the answer is a silent write of malformed data that later produces wrong query results, the strictest enforcement available is the right starting point.

I write about building document intelligence systems. If this was useful, follow along for the next piece.

Four Ways to Run Extraction More Than Once

Abhishek Gawde — Wed, 06 May 2026 17:20:53 GMT

The first time I ran extraction on a real project, I treated it as a single-pass operation: send the document in, get facts out. That works until it does not. Some failure modes are invisible in a single pass, and the only way to catch them is to run something again, differently.

The simplest extraction pipeline looks like this. A document arrives. The pipeline breaks it into chunks. It sends those chunks to a language model with a schema and a prompt. The model returns structured facts. The facts go into storage.

This is fine for documents that are self-contained and well-structured. It starts to break in predictable ways once documents get complicated. Four patterns have emerged in my work, each one designed to fix a specific failure mode.

Pattern 1: Single pass

The document is chunked, retrieved, and extracted once. Each job targets a specific slice of the document. Confidence scores come back per field. High-confidence facts go straight to storage. Low-confidence facts go to a review queue.

This is the baseline. Most documents in a legal project corpus are well-structured enough for it to work. Signature blocks have parties in predictable places. Payment schedules have numbers in tables. Dates appear in the operative clauses. For those documents, a second pass would not find anything a first pass missed.

The failure mode is when the relevant clause is not in the retrieved chunks, or when the model cannot make sense of what it sees with no surrounding context. Both look the same from the outside: the fact comes back empty, or with very low confidence. You cannot tell which problem you have without digging.

Pattern 2: Guided second pass

Some documents are not self-contained. A permit might say “in accordance with the noise limits defined in the Environmental Impact Assessment.” An EPC contract might say “as specified in Schedule 4.” The relevant fact is in a different document, and a single pass on the current document cannot find it.

The guided second pass handles this. The first pass extracts what it can. When it cannot resolve a reference, it returns the reference itself rather than an empty field. The pipeline then fetches the referenced document, pulls the relevant chunks from it, and runs a second extraction with both sets of chunks in context.

This is not a retry. It is a different call with richer input, aimed at a specific unresolved piece. The cost is one or two extra LLM calls per unresolved cross-reference.

I found this useful specifically for permits and contracts that incorporate external documents by reference. Without it, those cross-references produced empty fields or low-confidence guesses. With it, most of them resolved correctly.

Pattern 3: Full-document escalation

Chunk-and-retrieve has a ceiling. If the chunker splits a clause badly, or retrieval returns the wrong chunks, the model does not see the right text. No amount of reasoning ability compensates for missing input.

One response is to send the whole document when confidence is low. Instead of retrieving a small set of targeted chunks, the pipeline sends the full document text and asks the model to try again.

This does not work for long documents. A 150-page EPC contract is at or over the context limits of most models, and attention over very long inputs degrades in the middle. But it does work for shorter documents: building permits, grid connection agreements, certain regulatory submissions that run to twenty or thirty pages. For those, a full-document pass on a low-confidence result catches things that targeted retrieval missed.

The pattern I settled on: single-pass chunk retrieval first. Escalate to full-document only when a specific field comes back below the quarantine threshold and the document is short enough to fit comfortably in context.

Pattern 4: Consensus across multiple calls

A different response to uncertainty is to ask more than once and compare. Send the same document with the same prompt two or three times, or to different models, and look at where the answers agree. Fields where every call returns the same value get accepted with higher confidence. Fields where calls disagree are flagged for review.

Luminance, one of the commercial legal AI platforms, uses a version of this called “Panel of Judges”: multiple models vote and an orchestration layer adjudicates when they diverge.

I explored this but did not build it into my pipeline. Two reasons. First, the cost multiplies directly with the number of calls. For a pipeline processing several hundred documents with multiple extraction jobs each, running every job twice adds up fast. Second, agreement between calls is not the same as correctness. When two models share the same blind spot, they can agree on the wrong answer, and there is no signal in the agreement to indicate that.

Where this pattern seems most justified is on high-stakes, low-frequency extractions: an obligation with significant financial consequences, a key commercial term in a contract. Running it twice is a small cost relative to the consequence of getting it wrong.

What I actually found

These four patterns are not alternatives to choose from at the start. They are responses to observed failures.

Single pass is where every pipeline starts. The guided second pass became relevant when cross-document references kept showing up in the failure log. Full-document escalation became useful when short documents produced consistently low-confidence results that better retrieval did not fix. Multi-call consensus is the one I have not yet built, because the cost structure did not feel justified for the volume of documents I was working with.

The question worth asking before adding any pass is: what is actually failing? Empty fields and low-confidence results are symptoms. The cause could be retrieval returning the wrong chunks, chunking splitting the right clause in half, a prompt that is not specific enough, or text that is genuinely ambiguous. Adding a second pass fixes the retrieval and context problems. It does not fix a bad prompt. It does not resolve text that is actually unclear.

In my experience, most of the failure was in retrieval. When the right chunks were present, the model extracted the right facts. When they were absent, a second pass with better context helped. When neither pass worked, the document was usually referencing something that had not been ingested yet, and no extraction strategy could fix that.

I write about building document intelligence systems. If this was useful, follow along for the next piece.

The Model That Decides What Gets Found

Abhishek Gawde — Tue, 05 May 2026 17:14:53 GMT

Before I started building a retrieval pipeline for legal documents, I assumed the embedding model was a detail. Pick a good one, move on. What I found was that it is actually the first decision in the pipeline, and it shapes everything that comes after.

Here is the basic idea. An embedding model reads a piece of text and turns it into a list of numbers. That list represents the meaning of the text as a position in space. Two texts that mean similar things get similar numbers. Two texts that mean different things get different numbers.

When the pipeline tries to find the right clause to answer a question, it compares numbers. It does not read the text. It looks at which numbers are closest.

So the question becomes: whose idea of “similar” are we using?

The problem with general-purpose models

A general-purpose embedding model has been trained on enormous amounts of text from across many domains. News, books, web pages, code, conversations. It has a broad notion of what counts as similar.

In legal text, this creates a specific problem. An obligation clause, a definitions clause, and a rights clause all use formal language. They all contain words like “shall”, “party”, “agreement”, “term”. To a model that learned from general text, they look similar. They cluster together in the space.

But they are not the same thing at all. An obligation creates a duty. A right creates a permission. A definition interprets a word. Retrieving the wrong one in response to a question about obligations is a retrieval failure, and no amount of clever prompting downstream can fix it, because the LLM never sees the right clause.

What domain-specific training does

Voyage AI trained voyage-law-2 on around a trillion tokens of legal text, with a specific focus on getting legal distinctions right. The training was designed to push obligation clauses and definition clauses apart in the space, not together.

They benchmarked it against other models on eight legal retrieval tasks from the MTEB evaluation suite. It came out on top in seven of the eight, with a notable margin on three of them: LeCaRDv2, LegalQuAD, and GerDaLIR.

GerDaLIR is a German legal dataset. I kept coming back to that one because the documents I was working with were mainly in German.

A more recent benchmark called MLEB, published in late 2025, tested models specifically across judicial, contractual, and regulatory legal text. It found something worth noting: the models that perform best on general retrieval benchmarks do not necessarily perform best on legal ones. Gemini Embedding ranks first on the broad MTEB benchmark. It ranked seventh on MLEB. Voyage 3.5 ranks twenty-third on MTEB. It ranked third on MLEB. The domain matters.

The constraint that makes this decision sticky

Here is the thing that is easy to overlook. The model used to embed the documents at ingestion time must be the same model used to embed queries at retrieval time. Vector similarity only means something within the same space. Numbers from different models are not comparable.

That means switching embedding models later means re-embedding everything. If there are thousands of indexed chunks, that is not a small operation.

This is worth thinking about before the first document is indexed, not after.

A concrete data point from Harvey

Harvey AI partnered with Voyage to fine-tune an embedding model on their own legal retrieval task, starting from voyage-law-2 as a base. Their custom model reduced irrelevant results in the top results by nearly 25% compared to the next best off-the-shelf models, and did it at a third of the embedding dimensionality, which helps with storage and retrieval speed.

The path they took was: domain-specific base model first, then fine-tuned on their own data. That progression makes sense. The base model handles the general legal distinctions. The fine-tuning handles the specifics of the domain.

What I have not tested

I have not run a proper comparison on my own document types. The benchmarks cover court cases, legislative text, and general contracts. EPC contracts and building permits are not in any public evaluation set I have found.

Whether the gap that shows up on court cases also shows up on project development documents is genuinely unknown to me. Benchmark results are a reasonable prior but they are not a guarantee.

The test that would actually matter: embed a sample of real documents with both models, run the extraction queries you care about, and look at which chunks come back. That comparison tells you more than any published benchmark.

I write about building document intelligence systems. If this was useful, follow along for the next piece.

Which Model for Which Job

Abhishek Gawde — Mon, 04 May 2026 17:02:11 GMT

One pattern I kept falling back into was picking a single language model and routing every extraction task through it. It was the easiest setup to reason about. What I started to notice, after running a pipeline for a while, was that the jobs inside it were not actually similar to each other, and treating them as if they were had consequences I had not thought about upfront.

When I first built an extraction pipeline, I picked a model and sent everything to it. Party names, dates, obligations, classifications, commercial terms. All the same model, all the same temperature, all the same prompt shape. It was easier to think about. It also meant I never had to explain to myself why one job might need something different from another.

After a while, something started to look off. The bill was larger than it felt like it should be. A few jobs were clearly doing fine. A few were clearly struggling. The model choice was the same across all of them, so the variance had to be coming from somewhere else. That was what pushed me into thinking about routing.

I am not going to claim this is how pipelines should be built in general. I can only describe what I found when I tried to look at the jobs individually and asked whether they really needed the same thing. This is an exploration, not a recommendation.

What an Extraction Job Actually Looks Like

Before thinking about models, it helped me to be specific about what an extraction job was in my setup. A job took three things: a set of document chunks, a schema describing which fields to pull out, and a prompt describing how to pull them. It returned structured data matching the schema.

The shape of the job turned out to matter for the model choice. A few examples from what I was working on:

Document type classification. Input: the first page or two of a document. Output: one of roughly fifty possible document types in my ontology. The space of answers was small and fixed. The input was short. The decision seemed to lean on surface features like headings, boilerplate, and layout cues. When I looked at the examples where the big model got this right, a smaller model also got them right in spot checks. Not a proof, but enough to make me curious.

Party identification. Input: chunks retrieved from signature blocks and preamble sections. Output: a list of parties with their roles. The pattern was fairly surface-level in the documents I looked at. Parties tended to appear in predictable places, in a limited number of formats. A smaller model seemed to handle it once retrieval had pulled the right chunks. I would not claim this generalises, but for the documents in front of me it held up.

Commercial term extraction. Input: chunks from payment schedules, milestone tables, and LD clauses. Output: structured fields like payment amount, currency, due date, and liquidated damages rate. This was harder. The language was often dense, the numbers were often qualified by conditions, and the fields leaned on each other (a rate might be meaningless without the cap it was bounded by). A mid-tier model seemed to sit in the right place for this. A small model missed qualifiers in the examples I tested. Whether a frontier model would have done better is not something I measured carefully enough to be sure about.

Obligation extraction. Input: chunks from operative clauses. Output: a list of obligations, each with an obligated party, an action, a deadline, a consequence, and a confidence score. This was the hardest job in my pipeline. The model had to distinguish obligations from rights, definitions, and specifications. It had to resolve party references. It had to handle conditional language, carve-outs, and cross-references. This was the one place where a frontier model seemed to earn its cost on my data.

Four jobs, four very different shapes. Running all four through the same model stopped feeling like a neutral choice once I looked at them side by side.

The Cost Curve

Published pricing for the models I looked at showed a gap of roughly five to fifteen times per token between small and frontier models. That gap may narrow or widen depending on provider, and it is worth checking fresh numbers when you care about the answer.

In my setup, what made the gap more than an abstract figure was the distribution of calls. Classification jobs ran on nearly every document. Party identification ran on most of them. The heavier extraction jobs ran less often and on fewer chunks each time. The cheap-looking jobs were a large share of the call volume, and sending them to a frontier model meant paying frontier rates on work that did not visibly need it.

I want to be careful here. I did not do a rigorous cost study. I did a back-of-the-envelope estimate after watching the bill grow, and swapping the classification and party jobs to a smaller model on a test project appeared to cut the run cost meaningfully. “Meaningfully” is doing work in that sentence because I am not going to put a number on it that I cannot defend across other setups. Your distribution of jobs and documents will almost certainly look different from mine.

The point is not the specific saving. The point is that when jobs have different shapes and different frequencies, one model across all of them is a decision, not a default, and it is worth checking whether the decision is the right one

Where Small Models Seemed to Break

It would have been convenient if a small model was a drop-in replacement for the simpler jobs. On my data, it mostly was for classification and party identification. It was not, reliably, for anything more subtle than that.

The pattern I noticed, roughly, was that the small model I tested was comfortable with surface tasks and less comfortable when the task required holding multiple constraints in mind. Three cases where I saw this most clearly:

Cross-referencing within the input. Identifying a party on a signature page and then linking obligations in the body of the document to that party. The small model extracted both pieces correctly in many cases but failed to connect them. The larger model tracked the reference more often. I am describing what I saw on my test set, not a measured difference.

Distinguishing surface-similar categories. Obligations look like rights. Specifications look like obligations. Definitions look like operative clauses. Trigger words like “shall” appear across all of them. The smaller model leaned on those trigger words and over-extracted. The larger model seemed to read more of the clause structure before deciding. Again, this is a pattern I noticed, not a benchmark.

Handling conditions and carve-outs. An obligation that applies “except where the contractor has given notice under clause 14.3” behaves differently from one that applies unconditionally. The small model sometimes dropped the carve-out. The larger model preserved it more often. This one bit me in a specific way because dropped carve-outs looked like confidently extracted obligations, which is worse than a skipped one.

I hesitate to turn these into a general rule about small versus large models. What I can say is that on my data, for my jobs, these were the failure modes I saw, and they informed where I pushed jobs up or down the tier.

The One-Model Default

There is a real pull towards picking one model and staying with it. Swapping models per job means more testing, more prompt variation, more things to keep working. If a single model is acceptable across all jobs, the engineering budget is better spent on prompts, schemas, and retrieval.

The cost I did not notice at first was that a one-model pipeline tends not to surface the signal that a cheaper model would have been fine somewhere. Every job is running on something that is at least good enough, so the question of whether something cheaper would also have been good enough does not come up naturally. The answer is only visible if you go looking for it.

I am not saying routing is always worth it. For small pipelines, the operational overhead might exceed any saving. For pipelines where every job is genuinely hard, there may not be much to route towards. The judgement call is whether the jobs are varied enough that the savings and the quality differences are worth the additional complexity. In my case they were. In someone else’s they might not be.

A Process I Found Useful

What I ended up doing, roughly, was this. I did not invent it and I would not claim it is the right approach in general. It worked for me in the sense that it produced evidence I did not have before.

Pick a capable model and run each job against it. Treat that as a reference point for quality, not an absolute ceiling.

For each job, try a smaller model. Compare outputs against the reference on a held-out sample. If the gap seems small, note that and consider moving the job down. If the gap is clear, leave the job where it is.

Revisit the exercise every few months. Small models have been getting better quickly enough that assumptions from last year are worth re-testing. Jobs that needed a frontier model once may not need one now.

Two things to be honest about in this process. Comparing outputs against a reference is not the same as comparing outputs against ground truth. A cheaper model might agree with the reference while both are wrong. Some human-labelled examples make the comparison sharper but they take real effort to build. And “seems small” is a judgement, not a measurement. Pairing the eyeball comparison with a few concrete metrics (exact match on key fields, LLM-judge scores on free-text fields) helped me feel less like I was picking favourites.

What Settled For Me

The routing pattern I ended up with was: a cheap model for classification and surface-level extraction, a mid-tier model for structured extraction with internal dependencies, and a frontier model only for obligation extraction and other jobs that needed genuine reasoning about document structure. Schema enforcement at the API level using structured outputs helped, because it removed a class of validation errors that would otherwise have made the cheaper models look worse than they actually were.

This is what happened to work on my data, for my jobs, with the models available to me at the time I looked. A different document domain, a different ontology, or a different set of models might land somewhere else. I would be surprised if the underlying observation did not apply somewhere, which is roughly: if your jobs are not the same, it is probably worth checking whether your models should be. But I would be wary of anyone (including me) claiming the specific routing is the right answer for their setup.

The one thing I feel reasonably confident about is that treating model selection as something to ask per job, rather than something to decide once for the whole pipeline, gave me more information about what was actually going on. Whether that information leads to routing, or leads back to a single model with more confidence than before, is a separate question.

I write about building document intelligence systems: the architecture, the design decisions, and the things that do not work the way I expected. If this was useful, follow along for the next piece.

One Annotated Document Is Not Enough (But It Is a Start)

Abhishek Gawde — Sat, 25 Apr 2026 17:10:49 GMT

Here is the awkward truth about this pipeline: the accuracy numbers I have been quoting throughout this series come from a single annotated document. One 17-page German building permit. 61 ground truth obligations, hand-labelled by a human reviewer.

93% recall and 71% precision sound good. But those numbers describe how well the pipeline performs on one document from one jurisdiction in one language. Whether the same numbers hold for an English EPC contract, an Italian environmental permit, or a Spanish land lease is an open question. I do not have annotated ground truth for any of those yet.

This article is about the eval framework that makes those numbers possible, and about the gap between having a framework and having enough data to trust it.

The framework

The eval system is auto-discovering. To add ground truth for a new document, you create a folder with two files:

A config.json that identifies the document:

{
  "document_id": "wachow-baugenehmigung-sep2023",
  "document_type": "PERMIT",
  "language": "de",
  "source_file": "Baugenehmigung_Sep2023.pdf"
}

And an obligations.json that lists every obligation a human found in the document, with the same field structure the pipeline produces:

[
  {
    "description": "Construction must begin within 3 years of permit issuance",
    "clause_reference": "NB 1.1",
    "source_page": 3,
    "responsible_party": "Developer",
    "obligation_category": "REGULATORY",
    "severity": "CRITICAL"
  }
]

Drop those two files into a folder under tests/eval/, and the scorer finds them automatically. No code changes. No test registration. The eval runner discovers every folder, runs the pipeline against the source document, and compares extracted obligations against ground truth.

How scoring works

The scorer computes three metrics per document:

Recall: of the obligations in ground truth, how many did the pipeline find? A ground truth obligation counts as “found” if any extracted obligation matches it on normalised clause reference and has sufficient description overlap. The overlap threshold is generous because the pipeline’s phrasing will never exactly match the human’s.

Precision: of the obligations the pipeline extracted, how many match something in ground truth? This is the harder metric because it penalises both genuine false positives and granularity splits where the pipeline found a real obligation but split it into two entries.

F1: the harmonic mean of the two. A single number that balances recall and precision.

For fields where exact string matching does not work, like obligation descriptions where two valid phrasings can describe the same obligation, the framework uses LLM-as-judge. It sends the extracted description and the ground truth description to the model and asks “are these semantically equivalent?” This is not used for numeric fields (dates, amounts) where exact match is the right test.

What one document tells you

Even with a single annotated document, the eval framework is useful. It catches regressions. Every prompt change, every schema change, every model upgrade gets run against the one document we have. If recall drops from 93% to 85% after a prompt change, we know immediately.

It also establishes a baseline. Before the eval framework existed, accuracy was assessed by eyeballing. “This looks about right” is not a measurement. 56 out of 61 is a measurement. The gap between those two approaches is large.

And it creates a feedback loop. When the pipeline misses one of the 61 ground truth obligations, we can look at why. Wrong clause reference format that the normaliser did not handle? Obligation split across a page boundary that fell outside the batch overlap? Model just missed it? Each failure mode points to a specific fix.

What one document does not tell you

Generalisability. The building permit is a specific document type from a specific legal tradition. German regulatory permits use a predictable structure: numbered Nebenbestimmungen, each one a self-contained condition. The pipeline was tuned against this structure.

EPC contracts are different. Obligations are scattered through operative clauses, schedules, and annexes. They reference each other. They use defined terms. A single obligation might span three paragraphs and reference two schedules. Whether the pipeline handles this well is a question I cannot answer with confidence until I have annotated EPC ground truth.

Language is another axis. The conceptual prompt is language-agnostic by design, but the few-shot examples are in English. The confidence scorer’s quote relevance signal depends on word overlap between the English description and the original-language quote, which works differently for German (many shared technical terms) than for, say, Japanese (almost no shared terms).

One document gives you a baseline. Ten documents across three types and two languages would give you a system you can actually trust. I have one.

Growing the ground truth

The strategy for getting from one to ten is built into the review workflow. When a human reviewer approves an obligation in the review queue, that approved extraction automatically becomes a candidate for the ground truth dataset. The reviewer has already verified the description, the clause reference, the source quote. That is annotation labour that would otherwise be thrown away.

This is not free. The reviewer is approving individual obligations, not annotating a complete document. An approved set might have 50 obligations when the document actually contains 55. The missing 5 are obligations the pipeline did not extract and the reviewer never saw. So the auto-generated ground truth has high precision (everything in it is correct) but unknown recall (it might be missing things).

To close this gap, periodic completeness audits are needed. Take a document where the pipeline has been running for a while, sit down with the full text, and check whether the approved obligations cover everything. Those audits are expensive in human time, which is why there is only one fully annotated document so far.

The honest state of things

The eval framework is ready. The tooling works. Adding a new document to the test suite takes five minutes. The scorer runs on every change and blocks merges if metrics drop below threshold.

The bottleneck is annotation. Every additional annotated document makes the pipeline more trustworthy. Not just because it tests a new case, but because it reveals failure modes that were invisible with a single document. The first EPC contract ground truth will almost certainly surface precision problems that the building permit does not, because the document structure is so different.

Until that happens, the accuracy numbers in this series come with an asterisk. They are real measurements against real ground truth. They are also measurements against a single data point. Both things are true.

Part 8 of a series on building an LLM extraction pipeline. Part 1: [9 out of 61]. Part 2: [580 from 110]. Part 3: [A 106-year-old legal framework]. Part 4: [15 lines of rules]. Part 5: [The cheapest duplicate]. Part 6: [Confidence scores are not probabilities]. Part 7: [Five checks that cost nothing].

Five Checks That Cost Nothing and Catch What the Model Missed

Abhishek Gawde — Fri, 24 Apr 2026 16:10:55 GMT

A rule-based QA layer that runs after extraction and before anyone sees the results.

By the time an obligation reaches the review queue, it has been through a lot: batch extraction, clause type classification, confidence scoring. But none of those steps check whether the extraction is internally consistent. They check whether the model found something and how confident it is. They do not check whether what it found makes sense.

The QA layer does. It runs five rule-based checks on every extracted obligation, flags the ones that fail, and costs nothing because there is no LLM call involved. Just string comparisons, pattern matching, and field validation.

Check 1: Normalisation

The extraction schema has fields with expected formats. Clause references should look like “12.3(b)” or “Nebenbestimmung 4.1”, not “see above” or “various”. Dates should parse as actual dates, not “soon” or “TBD”. Responsible parties should be legal entity names, not descriptions like “the relevant authority”.

The normalisation check does not reject malformed values. It flags them. A clause reference of “see above” is not wrong in the sense that the model invented it. The document probably says “see above”. But it is not useful as a structured field, and the reviewer should know.

The implementation is pattern matching. Does the clause reference contain at least one digit? Does the date field parse against a set of known date formats (ISO, German DD.MM.YYYY, written-out month)? Is the responsible party longer than two characters and not on a blocklist of generic terms?

Check 2: Garbage detection

Sometimes the model produces extractions that are structurally valid but semantically empty. A description that is just the clause reference repeated. A source quote that is a single word. An obligation where every field is filled but the description is “See clause 12.3(b) for details.”

Garbage detection looks for these patterns. Description shorter than 20 characters. Source quote shorter than 10 characters. Description that is more than 80% identical to the clause reference. These are not useful extractions, and surfacing them in the review queue wastes reviewer time.

Flagged items are not deleted. They are marked as QA failures with a reason code, which the reviewer can see. Sometimes the flag is wrong and the extraction is fine. But more often, a garbage flag points to a real problem: the model found a clause header but could not extract the substance.

Check 3: Missing required fields

The extraction schema defines which fields are required for an obligation to be actionable. At minimum: a description, a clause reference, and a source page number. Without these three, the obligation cannot be traced back to the document, which defeats the purpose.

This check is a simple null/empty test on the required fields. If any required field is missing, the extraction is flagged. The confidence scorer penalises missing fields through the field completeness signal, but the QA check is blunter: it says “this extraction is incomplete” regardless of the overall score.

The two mechanisms are complementary. The confidence scorer produces a continuous score that feeds into triage. The QA check produces a binary flag that says “something is structurally wrong here.”

Check 4: Duplicate detection

The batch extraction step deduplicates within a single document using the normalised clause reference key. But the QA layer runs across the full extraction set for a project, catching duplicates that the per-document dedup misses.

The most common case: two documents reference the same obligation with slightly different clause numbering. The EPC contract calls it “clause 12.3” and the compliance annex calls it “condition 12.3”. The per-document dedup does not catch this because they come from different documents. The QA duplicate check compares descriptions across the full set using a simple token overlap threshold.

This is a lighter version of the full reconciliation step. Reconciliation embeds everything and runs clustering. The QA duplicate check just flags obvious overlaps so the reviewer is aware before reconciliation runs. Think of it as an early warning rather than a resolution.

Check 5: Grounding

This is the check that catches the most interesting failures. It asks: does the source quote actually support the obligation description?

The model sometimes produces a perfectly reasonable obligation description and a source quote that is from the right document and the right page, but the two do not match. The description says “the contractor shall complete commissioning by June 30” and the source quote is about site access hours. The model connected two things from the same page that do not belong together.

The grounding check measures word overlap between the description and the source quote, similar to the quote relevance signal in the confidence scorer but with a hard threshold rather than a continuous weight. Below the threshold, the extraction is flagged as poorly grounded.

This overlaps with the confidence scorer’s quote relevance signal. The difference is that the confidence scorer folds it into an overall score, where a bad grounding signal can be offset by strong signals elsewhere. The QA check treats it as a standalone flag. An extraction can have high confidence overall but still fail the grounding check if the quote and description are mismatched.

Why a separate layer

The natural question is: why not fold all of this into the confidence scorer? The scorer already penalises missing fields and low quote relevance. Why have a separate QA step?

Two reasons.

First, the scorer produces a number. The QA layer produces reasons. A confidence score of 0.72 tells the reviewer “this is borderline.” A QA flag that says “garbage: description is 12 characters” or “grounding: quote does not match description” tells the reviewer what is wrong. The flag is actionable in a way the number is not.

Second, the QA checks are hard filters that should not be smoothed over by other signals. An extraction with a missing clause reference is incomplete regardless of how good the description is. The scorer might give it 0.78 if everything else is strong. The QA layer flags it as structurally incomplete. Both are true. The reviewer sees both.

What it looks like in practice

Across the 9-document Wachow project, the QA layer flagged about 8% of extractions. Most flags were normalisation issues (clause references like “see section above”) and a handful of grounding failures. The garbage check caught 3 extractions that were essentially empty. The missing field check caught 5 extractions without clause references.

None of these were showstoppers on their own. But removing them from the default review queue view means reviewers spend their time on real obligations rather than sorting through incomplete or malformed items.

The total runtime for all five checks across 520 obligations: under a second. No API calls. No embeddings. Pattern matching and string comparison.

Part 7 of a series on building an LLM extraction pipeline. Part 1: [9 out of 61]. Part 2: [580 from 110]. Part 3: [A 106-year-old legal framework]. Part 4: [15 lines of rules]. Part 5: [The cheapest duplicate]. Part 6: [Confidence scores are not probabilities].

Art6 - Confidence Scores Are Not Probabilities (and Why That Matters)

Abhishek Gawde — Wed, 15 Apr 2026 15:58:44 GMT

Five observable signals, no LLM call. The caveat at the bottom matters: these are ranking scores, not probabilities.

By this point in the series, the pipeline extracts obligations, classifies them by type, reconciles duplicates, and detects document versions. But one problem keeps showing up at every stage: how much should you trust any individual extraction?

The model says it found an obligation in clause 12.3(b) requiring the contractor to complete commissioning by June 30. Is that real? Is the clause reference correct? Did the model hallucinate the deadline? Is the source quote actually from the document, or did the model paraphrase so aggressively that the connection to the original text is gone?

You could send every extraction back to the LLM and ask “how confident are you?” But self-reported model confidence is not very useful on its own. Models tend to be confidently wrong in exactly the cases where you need them to be uncertain.

Instead, I built a scorer that looks at observable evidence in the extraction itself. No LLM call required. It checks whether the extraction carries the signals you would expect from a real, well-grounded obligation.

Five signals

Each signal produces a value between 0.0 and 1.0, weighted and combined into an overall confidence score.

Clause reference present (weight: 0.15). Does the extraction include a clause reference? If the model says “this obligation comes from clause 12.3(b)”, that is a verifiable claim. If the model just says “the contractor must do X” with no clause reference, the extraction is harder to verify and more likely to be a hallucination. Signal: 1.0 if present, 0.0 if missing.

Source quote quality (weight: 0.25). The extraction schema asks the model to include an original-language source quote from the document. This signal checks whether the quote exists and whether it is long enough to be meaningful. A one-word quote is not useful. A full sentence from the original German or English text is strong evidence that the model actually found something specific in the document. Signal: 0.0 for missing, 0.5 for short quotes, 1.0 for substantive quotes.

Quote relevance (weight: 0.25). This turned out to be the most useful signal. It measures word overlap between the English-language obligation description and the original-language source quote. High overlap means the obligation description is clearly grounded in specific document text. Low overlap suggests the model may be paraphrasing too aggressively or generating a description that is not well-connected to the source material.

The implementation counts how many words appear in both the description and the quote. More shared words means the description is grounded in the source text. Fewer shared words means the model may have drifted. This works across languages because party names, dates, amounts, and clause numbers tend to appear in both the original text and the English description.

Field completeness (weight: 0.15). The extraction schema has required and optional fields. Responsible party, deadline, obligation category, severity. An extraction that fills most fields is more likely to be a well-understood obligation. An extraction with just a description and nothing else is more likely to be vague or poorly grounded. Signal: ratio of filled fields to total fields.

Clause type classification (weight: 0.20). From article 3 in this series. If the model classified the item as OBLIGATION, it gets the full weight. BOILERPLATE gets half. Anything else (RIGHT, INFORMATIONAL, DEFINITION) gets zero. This signal alone often pushes non-obligations below the quarantine threshold.

What these are not

I want to be explicit about this: these confidence scores are not calibrated probabilities. A score of 0.85 does not mean there is an 85% chance the extraction is correct. The scores are not trained against ground truth outcomes. They are not validated for statistical calibration.

What they are is a ranking function. Higher scores correlate with better extractions. Lower scores correlate with worse ones. The thresholds (0.90 for auto-accept, 0.70 for quarantine) were set by manual inspection of score distributions across the first batch of extractions, not by any formal calibration process.

This distinction matters because it is tempting to present confidence scores as precise measurements. “This obligation was extracted with 87% confidence” sounds scientific and trustworthy. But if the scoring function is a weighted sum of heuristic signals, 87% is not a probability. It is a position on a ranking scale.

Being honest about this does not make the scores less useful. It makes them more trustworthy, because the people using them understand what they are actually looking at.

The triage model

The scores feed into a three-tier triage:

Above 0.90: auto-accept. Written to the knowledge graph immediately. The extraction has a clause reference, a substantive source quote, good quote relevance, most fields filled, and is classified as an obligation. This combination rarely produces false positives.

Between 0.70 and 0.90: flag for review. Written to the knowledge graph but marked as unverified. Appears in the review queue. The extraction is probably fine but something is weak: maybe the quote is short, or the clause reference is missing, or field completeness is low. A human glances at it and approves or rejects.

Below 0.70: quarantine. Not written to the knowledge graph. Held in the review queue only. The extraction has multiple weak signals. Maybe it is classified as a RIGHT rather than an OBLIGATION. Maybe the quote relevance is very low, suggesting the description does not match the source text. Mandatory human review before it goes anywhere.

The quarantine tier is the important one. It is where the system says “I found something, but I do not trust it enough to act on it.” That honesty is what makes the pipeline usable for legal documents, where a confidently wrong extraction is worse than an uncertain one.

What the scores catch in practice

The most common pattern the scorer catches: the model extracts something that sounds like an obligation but has low quote relevance. The English description says “the contractor shall maintain the access road in good condition” but the German source quote is about something else entirely, maybe a general reference to site maintenance from a different section. The model connected two things that do not belong together. The quote relevance signal catches this because the word overlap between description and quote is low.

The second most common: missing clause reference combined with low field completeness. These tend to be vague extractions where the model identified a general theme (”the parties shall cooperate in good faith”) but cannot point to a specific clause. Often real provisions, but too vague to be actionable as tracked obligations.

The clause type signal is the bluntest instrument but the most reliable. If the model itself classified something as INFORMATIONAL or DEFINITION, losing 0.20 from the score almost always pushes it into quarantine. The model’s own classification, combined with the confidence penalty, creates a self-correcting loop.

Why this works as a free layer

The entire confidence scorer runs without any LLM call. It operates on data that already exists in the extraction output: the clause reference, the source quote, the description, the filled fields, the clause type. Computing all five signals for 520 obligations takes less than a second.

This means you can re-run it after changing thresholds without any cost. Wondering what happens if you lower the quarantine threshold from 0.70 to 0.65? Re-score, check the distribution, decide. No tokens spent. No waiting for API calls.

It also means the scorer is deterministic. Same extraction, same score, every time. This matters for the review workflow: a reviewer can trust that the score they see today is the same score they would see tomorrow.

What it does not catch

The scorer is good at catching poorly-evidenced extractions. It is not good at catching well-evidenced but wrong extractions. If the model produces a beautifully formatted obligation with a real clause reference, a relevant source quote, all fields filled, but gets the responsible party wrong, the scorer will give it a high confidence score.

Catching factual errors in well-formed extractions requires either ground truth comparison (which needs labelled data) or a separate verification step (which costs tokens). The confidence scorer is a first filter, not a guarantee.

For now, the combination of permissive extraction, clause type classification, confidence scoring, and human review for flagged items catches enough errors to be usable. Each layer catches a different kind of problem. None of them is sufficient alone.

Part 6 of a series on building an LLM extraction pipeline. Part 1: [9 out of 61]. Part 2: [580 from 110]. Part 3: [A 106-year-old legal framework]. Part 4: [15 lines of rules]. Part 5: [The cheapest duplicate].

Art5 - The Cheapest Duplicate Is the One You Never Create

Abhishek Gawde — Tue, 14 Apr 2026 15:57:11 GMT

Three layers of version detection, all running before extraction, preventing 25% of duplicates at source.

The reconciliation layer from the last article reduced 520 obligations to 448 by merging duplicates after extraction. But roughly 121 of those 520 should never have existed in the first place.

The Wachow project had a February B-Plan and a September B-Plan. Same document, different versions. The pipeline processed both and extracted the same obligations twice. Reconciliation caught some of these, but not all, because the two versions sometimes had slightly different wording for the same condition.

The fix was not better reconciliation. It was detecting the version relationship before extraction and letting the user supersede the older document.

Three layers of detection

Document versions announce themselves in different ways. Some are identical files re-uploaded to a different folder. Some share a name but differ by a date or version number. Some explicitly reference their parent document. Each needs a different detection method.

Layer 1: content hashing. SharePoint provides a quickXorHash for every file. Identical files, regardless of filename or folder, produce the same hash. This catches the easiest case: someone downloads a PDF, renames it, and uploads it to a different project folder. Detection confidence: 100%. Cost: a string comparison.

Layer 2: filename normalization. This is where most of the value comes from. The idea is simple: strip away the parts of a filename that change between versions and compare what remains.

The normalization pipeline:

Strip the file extension
Lowercase everything
Remove version markers: _v3, -Rev.02, _final, _FINAL_v2
Remove date patterns: ISO dates, German DD.MM.YYYY, compact YYYYMMDD, month-year combinations
Replace all separators (underscores, hyphens, dots) with spaces
Collapse whitespace

After normalization:

"EPC-Contract_2024-03-15_v2.pdf"  becomes  "epc contract"
"EPC-Contract_2024-09-01_v3.pdf"  becomes  "epc contract"
"B-Plan_Feb_2023.pdf"             becomes  "b plan"
"B-Plan_Sep_2023.pdf"             becomes  "b plan"

Files with identical normalized names get grouped as version candidates with 0.85 confidence. Files that do not match exactly but share 75%+ word overlap (Jaccard similarity on word tokens) get grouped at 0.70 confidence.

Layer 3: amendment references. Some documents explicitly name their parent. A “Nachtrag Nr. 1 zum Pachtvertrag vom 15.03.2024” is an amendment to a specific lease agreement dated March 2024. The system scans the first 2000 characters of each document for patterns like these in German and English:

“Nachtrag Nr. 1 zum Pachtvertrag vom 15.03.2024”
“Amendment No. 1 to the EPC Contract dated 15 March 2024”
“1. Änderung des Bebauungsplans Nr. 42”

When a match is found, the system links the amendment to its parent document and suggests the version relationship.

Why suggest, not automate

Version detection is surfaced in the UI as a suggestion. “These look like versions of the same document. The September version appears newer based on the date in the filename.” The user confirms or dismisses.

I considered automating this. If two files normalize to the same name and one has a later date, just supersede the older one automatically. It would simplify the workflow.

The problem is that false positives in version detection are worse than false positives in extraction. If extraction includes a false obligation, a reviewer catches it during review. If version detection incorrectly supersedes a document, every obligation from that document disappears from the pipeline. Silently. The reviewer does not even know to look for them.

Suggest-and-confirm adds one click to the workflow. It prevents a class of silent data loss that would be very hard to debug after the fact.

The generic approach that works across languages

I tested the normalization across four naming conventions:

German: B-Plan_Entwurf_2024-03-15.pdf
English: EPC_Contract_Rev02_March2024.pdf
Italian: Contratto_EPC_v2_2024-03-15.pdf
Spanish: Permiso_Ambiental_Feb2024.pdf

The approach works for all of them without any language-specific rules. The signal (the document’s semantic name: “B-Plan”, “EPC Contract”, “Permiso Ambiental”) is the part that stays stable across versions. The noise (dates, version numbers, revision markers) is the part that changes. Stripping the noise and comparing the signal is language-agnostic.

One thing I deliberately did not strip: domain-specific status words like “Entwurf” (draft) or “Final”. These carry classification meaning. A draft B-Plan and a final B-Plan might genuinely be different documents with different legal standing, not versions of the same thing. Stripping them would create false positives between unrelated documents.

The impact

Across the 9-document Wachow project, version detection identified 2 version pairs (the February/September B-Plans and two versions of a grid connection agreement). Superseding the older versions before extraction eliminated roughly 121 duplicate obligations.

That is about 25% of the total duplicates, removed before extraction even runs. No embedding cost. No reconciliation logic. No reviewer time spent on duplicates that should not exist.

Combined with the reconciliation layer, the pipeline went from 580 raw obligations to somewhere around 330 visible ones. Still above the 110 ground truth, but the remaining gap is mostly granularity splits and genuine cross-document references rather than pure duplicates.

A small thing that compounds

Version detection is not a sophisticated technique. Filename normalization is a few regex substitutions. Content hashing is a string comparison. Amendment detection is a handful of patterns.

But it runs at the earliest possible point in the pipeline, before any tokens are spent on extraction. Every duplicate it prevents saves extraction cost, reconciliation cost, and reviewer attention downstream. At 9 documents the savings are modest. At 300 documents per project, they compound.

Part 5 of a series on building an LLM extraction pipeline. Part 1: [9 out of 61]. Part 2: [580 from 110]. Part 3: [A 106-year-old legal framework]. Part 4: [15 lines of rules]. Next: how the whole pipeline costs $0.63 for nine documents.

Art4 - Why I Replaced My LLM Reconciliation Layer with 15 Lines of Rules

Abhishek Gawde — Mon, 13 Apr 2026 15:57:44 GMT

The LLM reconciliation layer was replaced with four threshold rules. Same results, 50x faster.

After extraction and classification, the pipeline was still producing roughly 520 obligations from 9 documents. The clause_type filter caught some false positives, but the bigger problem remained: duplicates.

Same obligation extracted from overlapping batches with slightly different wording. Same obligation appearing in two documents (the permit condition that also shows up in the EPC contract’s compliance annex). Same obligation extracted from two versions of the same document.

I needed a reconciliation layer. Something that could look at 520 obligations, find the ones that were really the same thing, and merge them.

I built it twice.

Version 1: let the LLM decide

The obvious approach: embed all 520 obligation descriptions, compute pairwise similarity, cluster the similar ones, then send each cluster to GPT-4o-mini and ask it to decide. MERGE, KEEP_BOTH, or FLAG_REVIEW.

Phase 1 (the embedding and clustering) worked well. Embed with text-embedding-3-small, compute cosine similarity, run agglomerative clustering from scipy. Partition by obligation category first so you are not comparing financial obligations against environmental ones. Group anything above 0.85 similarity.

Phase 2 (the LLM review) is where it fell apart.

It was slow. Each cluster required a separate LLM call. With dozens of clusters, the whole reconciliation step took 5 to 7 minutes. That does not sound terrible in isolation, but this is a step you re-run every time you process a new document or adjust extraction parameters. Five minutes of waiting after every change kills the iteration loop.

It was non-deterministic. The same cluster, same obligations, same prompt, would sometimes produce MERGE and sometimes KEEP_BOTH across different runs. The model was making a judgment call on borderline cases, and its judgment was not stable. A reviewer who approved reconciliation results on Monday could not trust them to be the same on Tuesday.

It cost money for no reason. About $0.15 per reconciliation run. Not expensive in absolute terms, but the cost scales with the number of obligations and the number of re-runs. And the LLM was not doing anything that required language understanding. It was looking at two similar texts and deciding whether they were similar enough. That is a numerical comparison, not a language task.

What the LLM was actually doing

I looked at the LLM’s decisions across several runs and noticed a pattern. In the vast majority of cases, its decision correlated almost perfectly with the cosine similarity score it was given as context.

Above 0.95 similarity: almost always MERGE. Between 0.88 and 0.95: mixed, unstable decisions. This was the non-deterministic zone. Below 0.88: almost always KEEP_BOTH.

The LLM was essentially converting a continuous similarity score into a discrete decision, and doing it inconsistently in the middle range. I was paying for a slow, expensive, unreliable threshold function.

Version 2: threshold rules

The rewrite was short:

@dataclass
class ThresholdConfig:
    auto_merge: float = 0.95
    flag_review: float = 0.88
    same_doc_clause_merge: float = 0.90
    same_category_merge: float = 0.95

Five rules, checked in order:

Same document, same clause reference, similarity above 0.90: MERGE. This catches the batch overlap problem, where the same obligation gets extracted from two overlapping 6-page windows with slightly different phrasing.
Same obligation category, similarity above 0.95: MERGE. Cross-document duplicates where the same condition appears in a permit and in a contract annex.
Any pair above 0.95 similarity regardless of metadata: MERGE.
Similarity between 0.88 and 0.95: FLAG_REVIEW. The borderline zone where the LLM was unstable. Now it goes to a human instead of a coin flip.
Below 0.88: KEEP_BOTH. Different obligations that happen to sound similar.

When a cluster merges, the canonical item is selected deterministically: highest confidence score, then longest description, then first by insertion order. No randomness. No judgment calls.

The results

520 obligations in. 448 visible out. 72 merged as duplicates.

Time: 7 seconds. Down from 5 to 7 minutes.

LLM cost: $0. Down from $0.15.

Embedding cost: $0.003 (embedding 520 descriptions with text-embedding-3-small).

Deterministic: yes. Run it ten times, get the same result ten times.

The 7 seconds is almost entirely the embedding step. The threshold comparisons themselves are instant.

Why the middle zone matters

The temptation was to set a single threshold and automate everything. Above 0.90, merge. Below 0.90, keep. No human review.

I tried this. The problem is that the 0.88 to 0.95 range contains genuinely ambiguous cases. Two obligations that describe similar duties but with different responsible parties. Two conditions that sound alike but apply to different project phases. A human can distinguish these in seconds by reading the clause references and context. An automated threshold cannot.

FLAG_REVIEW is not a compromise. It is the correct answer for cases where the similarity score alone does not carry enough information. The system is honest about what it knows and what it does not.

In practice, about 15% of clusters land in the FLAG_REVIEW zone. The rest resolve automatically. A reviewer spends a few minutes on the flagged items rather than hours reviewing everything.

The general principle

Use LLM calls for tasks that require genuine language understanding: extraction, classification, summarization, reasoning about ambiguous text. Use algorithmic approaches for tasks that are fundamentally about numerical comparison: similarity thresholds, clustering, deduplication, sorting.

The reconciliation task looked like it needed language understanding. Two obligation descriptions, written in slightly different ways, do they mean the same thing? That feels like a question only a language model can answer.

But the embedding step already converted language into numbers. After embedding, the question is: is 0.93 similar enough to merge? That is not a language question. It is a threshold question. And threshold questions have deterministic, instant, free answers.

It is easy to end up with an LLM call sitting in a hot loop, doing something that could be a lookup table, a regex, or a comparison operator. Not because the LLM cannot do it, but because it is the wrong tool for that specific subtask.

The NVIDIA SemDeDup paper on semantic deduplication for training data uses the same approach: embed, cluster, threshold. No LLM in the dedup loop. The Splink record linkage library does the same for entity resolution. Embed or featurize, compare, threshold, human review for borderline cases.

The pattern is consistent across domains: language models for understanding, algorithms for deciding.

Part 4 of a series on building an LLM extraction pipeline. Part 1: [9 out of 61]. Part 2: [580 from 110]. Part 3: [A 106-year-old legal framework]. Next: preventing duplicates at the source with document version detection.

Art3 - A 106-Year-Old Legal Framework Improved My LLM’s Extraction Accuracy

Abhishek Gawde — Sun, 12 Apr 2026 15:57:36 GMT

Extract everything, classify by type, let the confidence scorer decide what reaches the review queue.

The precision problem from the last article boiled down to this: the model could not reliably distinguish an obligation from things that look like obligations but are not.

“The Contractor shall complete commissioning by 30 June” is an obligation. “The Employer may extend the deadline by 30 days” is not. “The Contractor warrants that all materials meet ISO standards” is not. “Hinweis: The local fire department should be notified” is not.

To a human with legal training, these are obviously different. To an LLM told to “extract all obligations,” they all contain action verbs, parties, and specificity. They all pattern-match against what an obligation looks like on the surface.

I needed a way to teach the model the underlying distinctions. Not through more negative examples or longer exclusion lists, but through a conceptual framework it could apply consistently.

I found one in a paper from 1917.

Hohfeld’s framework

Wesley Newcomb Hohfeld was a Yale law professor who published a paper called “Fundamental Legal Conceptions as Applied in Judicial Reasoning.” His contribution was deceptively simple: he broke all legal relationships into eight atomic concepts, arranged in pairs.

The four that matter for extraction:

Duty (obligation): a party must do something. Breach carries legal consequences. “The Contractor shall install the substation by 30 June.”

Right: a party is entitled to something. The counterpart of a duty. “The Employer is entitled to liquidated damages if commissioning is delayed.”

Privilege (or liberty): a party may do something. No one can compel them not to. “The Employer may extend the deadline at its discretion.”

No-right: a party cannot demand something from another party. The counterpart of a privilege.

The key insight is that these are not fuzzy categories. They are precise, mutually exclusive classifications. A clause is a duty or a right or a privilege. It cannot be two at once. And critically, only duties are obligations. Rights, privileges, and their counterparts look similar on the surface but impose no binding requirement on anyone.

Turning the framework into a prompt

I did not try to teach the model Hohfeld’s full taxonomy. Instead, I translated the relevant distinctions into a classification scheme the model could apply during extraction:

OBLIGATION       - binding duties ("shall", "must")
RIGHT            - entitlements and permissions ("is entitled to", "may")
CONDITION_PRECEDENT - conditions gating effectiveness
REPRESENTATION   - statements of fact ("warrants that", "represents that")
INFORMATIONAL    - advisory notes ("Hinweis", "for information")
BOILERPLATE      - procedural provisions
DEFINITION       - interpretive clauses

The extraction prompt includes a decision table. Rather than telling the model to extract only obligations and skip everything else, I told it to extract everything it found in binding sections and classify each item:

Even if you are unsure, still extract the item and set clause_type to the most appropriate value.

This is a deliberate design choice. The old approach asked the model to make a binary decision during extraction: is this an obligation or not? The model was bad at this because the boundary is genuinely fuzzy in some cases. The new approach asks the model to make a classification decision: what kind of clause is this? That is an easier question, and the downstream system handles the filtering.

Why “extract everything, classify later” works better

The binary approach (”extract only obligations”) forces the model to be both extractor and filter simultaneously. When it encounters something ambiguous, it has two options: include it (risking a false positive) or skip it (risking a false negative). Given a prompt that emphasises thoroughness, it includes.

The classification approach separates the two tasks. The model extracts everything it finds and labels each item. A non-obligation item is not deleted. It is classified as RIGHT or INFORMATIONAL or DEFINITION, and the confidence scorer downstream uses that classification as one of several signals.

The confidence scorer gives the clause_type signal a weight of 0.20 in the overall score. An item classified as OBLIGATION gets the full 0.20. An item classified as BOILERPLATE gets 0.10 (real provision, just procedural). Anything else gets 0.0.

Losing 0.20 from the confidence score is often enough to push a non-obligation below the 0.70 quarantine threshold. The item is not thrown away. It sits in quarantine where a human reviewer can look at it. This matters because the model is sometimes wrong about its own classification. Silent deletion loses data. Quarantine preserves it.

Few-shot examples from scholarship, not from your own data

I added three positive and two negative examples to the prompt. The choice of where to source examples was deliberate.

The tempting approach is to pull examples from your own extracted data. You have hundreds of obligations already. Pick five good ones and five bad ones.

The problem: if your extraction pipeline has systematic biases (and it does, every pipeline does), training on your own output creates a feedback loop. The model learns your biases, not the ground truth.

Instead, the examples came from legal scholarship patterns based on the Hohfeldian framework. Generic obligations from different legal domains: a construction deadline with liquidated damages, a noise restriction with permit revocation risk, a regulatory submission requirement. The negative examples show things that look like obligations but are not: an employer’s discretionary right to extend a deadline, a definitional clause.

These examples teach the concept without anchoring the model to my specific documents or my specific errors.

Granularity guidance

One more precision problem the framework helped with: granularity splits.

“The Contractor shall install the transformer, connect it to the grid, and commission it by 30 June” is one obligation. But the model sometimes split it into three: install, connect, commission. Each sub-item looks like an obligation in isolation. The model is not wrong exactly, just too granular.

The prompt now includes explicit guidance:

Do NOT split a single clause into multiple obligations unless it genuinely imposes distinct duties on different parties or with different deadlines. “The Contractor shall do X, Y, and Z by Date D” is ONE obligation, not three. “The Contractor shall do X by Date D; the Employer shall do Y by Date E” is TWO.

The rule is simple: same party, same deadline, same clause means one obligation regardless of how many sub-actions it contains. Different parties or different deadlines means multiple obligations even if they share a clause.

What changed in practice

The Hohfeldian classification was more visible on contracts than on permits. German building permits are mostly binding conditions. There are not many rights or definitions to misclassify. But EPC contracts are full of representations, warranties, rights, and definitions that the old pipeline was extracting as obligations.

The granularity guidance had a more uniform effect. Across all document types, the number of granularity splits dropped noticeably.

Combined with the confidence scoring, the clause_type classification gave the pipeline a way to be permissive during extraction and selective during review. The model extracts broadly. The scorer filters. The human reviewer sees only the items that passed both gates.

The transferable lesson

When your model cannot distinguish categories, look for existing taxonomies in the domain literature before inventing your own.

Hohfeld already solved the legal clause classification problem in 1917. The categories are precise, well-defined, and map directly onto the distinctions an extraction pipeline needs to make. I did not need to invent a taxonomy through trial and error. I needed to find the one that already existed and translate it into a prompt.

This applies beyond legal documents. Medical records have established classification frameworks for symptoms, diagnoses, and treatments. Financial documents have standardised categories for assets, liabilities, and contingencies. Engineering specifications have formal taxonomies for requirements, constraints, and guidelines.

The domain experts already built the taxonomy. The LLM just needs to be told about it.

Part 3 of a series on building an LLM extraction pipeline. Part 1: [9 out of 61]. Part 2: [580 from 110]. Next: why I replaced my LLM reconciliation layer with 15 lines of threshold rules.

Art2 - When Your LLM Finds 580 Things and Only 110 Are Real

Abhishek Gawde — Sat, 11 Apr 2026 15:56:36 GMT

Gap analysis of 580 obligations extracted from 9 documents. Only 110 were real.

In the last post, I described how a conceptual prompt and batch extraction took obligation recall from 14.8% to above 90%. The model was finding nearly everything.

Then I ran it across a full project.

Nine documents from a single solar development: building permits, an EPC contract, land leases, grid connection agreements. The pipeline extracted 580 obligations. A human reviewing those same documents would expect roughly 110.

Five times too many. And unlike the recall problem, which had one root cause (the output ceiling), the precision problem had five causes stacked on top of each other.

The gap analysis

I went through the 580 one by one. Here is roughly where they came from:

~110 were real. Genuine, distinct obligations. The signal.

~121 were document version duplicates. The project had a February B-Plan and a September B-Plan. Same document, different versions. The pipeline processed both and extracted the same obligations twice. This was not an extraction error. It was a pipeline design error. We never told the system these were versions of the same document.

~100 were cross-document duplicates. The same obligation referenced in multiple documents. A building permit condition that also appears in the EPC contract’s compliance annex. Both extractions are correct. Both are the same underlying obligation.

~100 were granularity splits. A single obligation split into two or three sub-obligations. “The Contractor shall install the transformer, connect it to the grid, and commission it by 30 June” is one obligation. The model sometimes reported it as three.

~50 were over-extractions. Rights, definitions, and advisory notes classified as obligations. “The Employer may extend the deadline by 30 days” is a right, not an obligation. “Hinweis: The local fire department should be notified” is advisory, not binding. The model extracted both as obligations.

~100 were near-duplicates. The same obligation, slightly different wording across extraction batches. Because each batch runs independently, the model sometimes phrases the same obligation differently when it appears near a batch boundary. The deduplication catches exact matches but misses paraphrases.

Why over-extraction is rational

Here is the thing that took me a while to understand: the model is not making a mistake. It is making a rational decision given its instructions.

When the prompt says “be thorough, extract ALL obligations, do not miss any,” the model hears a clear signal about which error is worse. Missing a real obligation (false negative) is worse than including a fake one (false positive). So when the model encounters something ambiguous, like an advisory note that uses the word “should” or a right that looks structurally similar to a duty, it includes it.

This is actually correct behaviour for a human-in-the-loop system. It is better to review a false positive than to miss a real obligation that carries legal consequences. But it means precision has to be solved separately, with mechanisms that the extraction prompt alone cannot provide.

The prompt got us recall. Everything else had to get us precision.

Three responses to one problem

The five causes mapped to three categories of fix:

Fix the extraction itself to reduce over-extraction and granularity splits at the source. This meant teaching the model what an obligation is not, adding clause type classification, and giving explicit granularity guidance. That is the next article in this series.

Add a reconciliation layer to detect and merge near-duplicates and cross-document duplicates after extraction. Embed all obligations, cluster by similarity, decide which clusters should merge. This went through two complete implementations: an LLM-based approach that worked but was slow and non-deterministic, and a threshold-based rewrite that runs in 7 seconds. That story is two articles from now.

Detect document versions before extraction to prevent version duplicates from ever being created. If the system knows that the February B-Plan and September B-Plan are versions of the same document, it can let the user supersede the older one before extraction runs. The cheapest duplicate to remove is the one you never create. That is also a later article.

The uncomfortable middle

Here is where the project sat after the gap analysis:

93% recall (56 out of 61 ground truth obligations found)
71% precision (56 out of 76 extracted were correct, on the best single document)
Across 9 documents: 580 raw obligations, roughly 110 real

The recall number was good. The precision number was not terrible for a first pass, but 580 obligations in a review queue when 110 are real is not a usable product. A human reviewer would lose trust after the first fifty false positives.

The temptation at this point was to tighten the extraction prompt. Add more negative examples. Be more explicit about what not to extract. But prompt tightening is a precision-recall tradeoff. Every rule you add to reduce false positives risks missing a real obligation that does not fit the rule. I had already learned this lesson with keyword lists.

The better approach was to keep extraction permissive, accept that the model will over-extract, and build downstream systems that filter, merge, and surface only the high-confidence results. Let the model be thorough. Let the pipeline be precise.

That split, between a generous extractor and a strict post-processor, shaped everything that came after.

What this means beyond legal documents

If you are building any LLM extraction pipeline, you will probably hit this same pattern. Early work focuses on recall: can the model find the things? That problem is usually solvable with better prompts, more context, and batch processing.

Then you hit the precision wall. The model finds too many things. Some are duplicates. Some are misclassified. Some are the same thing expressed differently. And unlike recall, precision does not have a single fix. It requires multiple mechanisms working together: better classification at extraction time, post-extraction deduplication, confidence scoring, and human review for the borderline cases.

The instinct is to solve precision in the prompt. Sometimes you can, partially. But for any non-trivial extraction task, precision is a systems problem, not a prompting problem.

The next article is about the first piece: teaching the model what an obligation is not, using a legal framework from 1917.

This is part 2 of a series on building an LLM extraction pipeline. Part 1: [I Asked GPT-4 to Find 61 Obligations. It Found 9.] Part 3 will cover clause type classification and the Hohfeldian framework.

Art1 - I Asked GPT-4 to Find 61 Obligations in a Legal Document. It Found 9.

Abhishek Gawde — Fri, 10 Apr 2026 15:58:38 GMT

Three fixes took extraction from 9 to 56 out of 61 ground truth obligations.

A solar project generates 300 to 500 documents over its lifetime. EPC contracts, building permits, land leases, grid connection agreements, environmental assessments. Buried inside those documents are the binding obligations: things that must actually be done, by specific parties, by specific dates, or the project faces legal consequences.

Today, someone reads the documents, copies obligations into a spreadsheet, and hopes they caught everything.

I wanted to automate that. So I pointed GPT-4o-mini at a 17-page German building permit and told it to extract all the obligations.

It found 9. A human reviewer found 61.

That is 14.8% recall. Not a rounding error. A near-total miss.

What went wrong

Three things, and they are worth understanding because they apply to any LLM extraction task, not just legal documents.

The output ceiling. GPT-4o-mini has a 16K token output limit. A 17-page document with dozens of obligations produces structured output that easily overflows that limit. The model does not warn you. It just silently stops generating. You get the first handful of obligations and nothing else. You do not even know you are missing anything unless you count.

English keywords for a German document. The system prompt listed English trigger words: “shall”, “must”, “is required to”. The document was in German. It uses constructions like “ist zu”, “hat zu”, “Die Nebenbestimmungen sind zu beachten.” The model was pattern-matching on words that were not there.

No structural awareness. German building permits have a specific anatomy. They contain Nebenbestimmungen (binding conditions), Auflagen (regulatory conditions), and Hinweise (advisory notes). The prompt did not distinguish between these, so the model had no framework for deciding what counted as an obligation and what was just informational guidance.

Each of these problems is a version of the same underlying mistake: treating extraction as a keyword search rather than a reasoning task.

The fix: teach concepts, not keywords

The keyword approach was fundamentally broken. No list of trigger words can cover every language, every legal tradition, and every drafting style. A German Baugenehmigung uses different constructions than an English EPC contract, which uses different constructions than a Spanish environmental permit. Chasing keywords is a losing game.

Instead, I described the concept of an obligation:

An obligation is a binding duty imposed on a party by contract, statute, or regulation that requires that party to perform (or refrain from performing) a specific act, where breach carries a legal remedy.
Key elements:
A duty-bearer is identified (a specific party must act)
An action or forbearance is specified (what must be done or not done)
Binding force exists (breach carries legal consequences)

Then I added structural guidance for the specific document tradition:

In permits and regulatory approvals (Baugenehmigung, environmental permits):
Binding conditions (Nebenbestimmungen, Auflagen): extract each one
Hinweise (advisory notes): classify as INFORMATIONAL, not OBLIGATION

The insight is simple but easy to miss. GPT-4o already understands legal obligation patterns across languages. It knows what “ist zu” means in the context of a German regulatory document. You do not need to teach the model German legal terminology. You need to teach it the concept and trust that it can recognise that concept in whatever language it encounters.

This is the difference between a keyword list and a conceptual prompt. A keyword list says “look for these specific strings.” A conceptual prompt says “here is what an obligation is, now find all instances of this concept in the document, regardless of how it is expressed.”

The result, and the new problem

The conceptual prompt, iterated over a few rounds of refinement, took extraction from 9 to 51 obligations. A dramatic improvement, but still short of the 61 ground truth.

The remaining gap was not a prompting problem. It was a mechanical one. Even with the improved prompt, the model was still hitting the output ceiling. A 17-page document with 61 obligations produces more structured output than fits in the response window. The model was finding the obligations but running out of space to report them.

The fix for that was batch extraction: splitting the document into overlapping chunks and processing each chunk independently. That took us from 51 to above 90% recall.

But something unexpected happened on the way to high recall. Once the model was finding nearly everything, precision collapsed. From a single project with 9 documents, the pipeline extracted 580 obligations. A human would expect about 110.

It turned out that getting an LLM to find obligations was the easy part. Getting it to stop finding things that are not obligations was the hard part.

That is the next article.

This is the first in a series about building an LLM-powered extraction pipeline for infrastructure project documents. Each post covers one engineering problem and one transferable lesson. The domain is solar energy, but the patterns apply to any structured extraction task.

The Queue Between the Machine and the Graph

Abhishek Gawde — Tue, 31 Mar 2026 17:30:55 GMT

The human review queue is not a failure state. It is a designed component. The shape of the interface, what a reviewer sees and what choices they have, directly affects both the quality of the knowledge base and how quickly it accumulates trustworthy data.

Three Reasons a Fact Enters the Queue

Not everything that gets extracted needs human review. The pipeline makes a distinction based on confidence.

Facts above 0.90 go directly into the knowledge graph. A party name from a signature block. A milestone date stated explicitly as a calendar date. The pipeline is reliable enough at this level that reviewing them all would be expensive without meaningfully improving quality.

Facts between 0.70 and 0.90 go into the graph but are marked as unverified. They are live and queryable. A deadline derived from “sixty days after notice to proceed” where the notice date was not in the retrieved chunks. A liquidated damages rate where the clause structure required some inference. These facts are useful but not fully trusted. The queue is optional for these: a reviewer can confirm or correct them, and the system surfaces them with a caveat in answers regardless.

Facts below 0.70 do not go into the graph at all. A party reference that uses “the relevant contractor” without identifying who that is. A deadline expressed relative to an event the extraction could not locate. Until a human reviews these, they do not exist in the knowledge base.

There is a fourth type: conflict escalations. When two documents contain contradictory facts and the conflict resolution rules cannot determine which takes precedence, the conflict goes to a human. An amendment that changes a rate but whose effective date is ambiguous. Two permits that both appear to apply to the same condition with different expiry dates. These require judgment, not just confirmation.

What a Reviewer Needs to See

Too little context and the reviewer cannot make a good decision. Too much and review becomes slow and exhausting.

The minimum useful display for a queue item is four things: the extracted value with its field name; the source clause verbatim with clause number and page; one or two surrounding clauses for context, because some clauses only make sense in relation to adjacent ones; and any competing fact already in the graph from a different document, shown side by side.

The confidence score should also be visible as a plain signal, not a number buried in metadata.

Three entry types, three actions, one feedback loop. Facts below 0.70 are held out of the graph entirely until a human confirms them. The edit-and-approve action is the most valuable: it records both the original extraction and the correction as a training signal for systematic improvement.

The Three Actions and What They Trigger

A reviewer has three choices: approve, reject, or edit and approve.

Approving writes the fact to the graph with a human-verified flag and confidence set to 1.0. It is also automatically added to the ground truth dataset, so the next evaluation run has one more confirmed example to work with.

Rejecting records the fact as rejected with the source noted. It does not enter the graph.

Editing and approving is the most valuable action. The reviewer corrects what the extraction got wrong and approves the corrected version. Both the original extraction and the correction are recorded together. This pair — what the model produced versus what was correct — is a direct signal about which field types or document structures are systematically causing errors. Over time this is more actionable than a precision metric alone.

Prioritisation and the Feedback Loop

A queue that fills up and stays full is not useful. Two things help manage it.

Prioritisation: an obligation with a deadline three weeks away and significant liquidated damages exposure is more urgent than a warranty period on equipment already commissioned. The queue interface can surface financial significance based on what the graph already knows about the fact and its associated contract.

Escalation: items older than a set number of days without action trigger an alert. Items blocking specific milestones get flagged. The queue should actively surface what needs attention, not act as passive storage.

The deeper value of the queue is as a source of labelled training data. Every approved item is a fact that was uncertain enough to need review and has now been confirmed by a human expert who read the source clause. The ground truth built this way is weighted toward the hard cases, the ones where the pipeline struggles, which is exactly where evaluation and improvement need to be targeted. A system being actively reviewed gets better at its difficult extractions over time without anyone consciously directing that improvement.

Where I Am Taking This

The next article looks at what never-delete means as a design principle: why documents are versioned rather than overwritten, why facts are superseded rather than updated, and what makes this the right choice for systems where legal history matters.

I write about building enterprise document intelligence systems: the architecture, the design decisions, and the things that do not work the way you would expect. If this was useful, follow along for the next piece.

Why Search Works Better When You Run Two Different Approaches at Once

Abhishek Gawde — Mon, 30 Mar 2026 17:31:17 GMT

When you search for something in a document system, something has to decide which chunks of text are most relevant to your question. The way that decision gets made matters more than most people realise, and the tradeoffs between different approaches are not obvious until you see them fail in practice.

There are two main approaches in wide use today. Keyword search has been around for decades. Semantic search using vector embeddings is newer. Both have genuine strengths and genuine blind spots. The interesting design question is not which one to use, but how to combine them so each one covers the other’s weaknesses.

BM25 finds the exact terms. Vector search finds the meaning. Neither alone finds everything. Reciprocal Rank Fusion combines the two ranked lists using position rather than raw score, then a cross-encoder re-ranks the top candidates for a final quality pass.

How Keyword Search Works and Where It Fails

Keyword search operates on exact terms. It looks for documents that contain the words you typed, weighted by how often those words appear and how rare they are across the whole collection. A term that appears frequently everywhere is less useful as a signal than a term that appears rarely but appears in the documents you are searching.

This approach is very good at finding things when you know the exact terminology. If you search for “liquidated damages cap” in a collection of contracts, keyword search will find every clause that uses those exact words. It is fast, it is precise, and it does not require any machine learning or training data.

The failure mode is vocabulary mismatch. If one contract uses “liquidated damages” and another uses “delay penalties” to mean the same thing, a keyword search for one will not find the other. The meaning is identical. The words are different. Keyword search does not know that.

In a multilingual document collection, this problem compounds. A question asked in English will not find relevant clauses written in German, even if the German clause says exactly the same thing, because the words do not overlap.

How Semantic Search Works and Where It Fails

Semantic search converts text into numerical representations called embeddings. Each piece of text, whether a query or a document chunk, becomes a point in a very high-dimensional space. The key property is that text with similar meaning ends up close together in that space, regardless of the specific words used.

When you search semantically, the system finds document chunks whose embeddings are closest to the embedding of your query. This is why it handles vocabulary mismatch well. “Liquidated damages” and “delay penalties” will embed close to each other if the surrounding context makes their meaning clear. A question in English will embed near its German equivalent because the model has learned that they mean the same thing.

The failure mode is specificity. Semantic search is good at finding things that are thematically related to your query. It is less reliable when you need something very specific. If you search for “clause 14.3” expecting to find a particular clause with that reference, semantic search may return clauses that are thematically similar but do not contain that reference at all. The model has learned what clauses mean, but it treats clause numbers as words rather than as identifiers that should match exactly.

Technical terms, proper nouns, reference numbers, and specific identifiers are all cases where semantic search can mislead you.

Running Both Simultaneously

The insight behind hybrid retrieval is that these two failure modes are mostly non-overlapping. Keyword search fails on vocabulary mismatch. Semantic search fails on exact identifiers. Running both and combining the results means each approach covers the cases the other misses.

In practice, both searches run against the same document collection at the same time. Each produces a ranked list of results. Then those two ranked lists are merged into a single list.

The merging step is not as simple as averaging the scores. The scores from keyword search and semantic search are not on the same scale and cannot be compared directly. A score of 0.8 from keyword search means something completely different from a score of 0.8 from semantic search.

The merging approach that works well is called Reciprocal Rank Fusion. Instead of using raw scores, it uses the positions in each ranked list. A result that appears near the top of both lists gets a high combined score. A result that only appears in one list gets a lower combined score. The formula for each contribution is simple: one divided by a small constant plus the rank position. The combined score is the sum of these contributions across both lists.

The reason this works better than score-based merging is that rank positions are comparable even when raw scores are not. The combined score reflects how well a result does across both approaches, which is a better signal than how well it does on just one.

The Final Quality Gate

After the two ranked lists are merged, there is one more step. A re-ranking model takes the top results and re-scores them using a more careful but more expensive method.

The two-stage approach, fast retrieval followed by careful re-ranking, is a common pattern in information retrieval. The first stage retrieves a broad candidate set quickly. The second stage evaluates those candidates more carefully against the specific query. Doing the expensive evaluation on a small set rather than the whole collection keeps the system fast enough to use in practice.

The re-ranking model used here looks at the query and a candidate result together as a pair, and scores how well they match. This is more accurate than comparing independent embeddings because it considers the specific relationship between this query and this result rather than each one in isolation.

The tradeoff is speed. Running this on thousands of candidates is not feasible. Running it on the top twenty or so results from the merged list is.

Why This Matters for Legal Documents Specifically

Legal documents create specific retrieval challenges that make hybrid search more valuable than it would be for general text.

Legal language uses precise defined terms consistently within a document. “The Contractor” means a specific party. “The Completion Date” means a specific date defined elsewhere. Keyword search handles these defined terms well.

But the same concept often appears differently across documents from different authors or jurisdictions. One contract says “liquidated damages.” Another says “delay penalties.” A financing agreement says “break costs.” These are economically equivalent but lexically different. A keyword search for any one will not find the others. Semantic search handles this well.

Clause references are another case where keyword search matters. “As defined in clause 8.2” or “subject to the provisions of schedule 4” are exact references that should match precisely. Semantic search might return thematically related clauses rather than the specific one referenced.

Running both approaches means neither failure mode dominates. Exact references are found by keyword search. Conceptual equivalences across documents and languages are found by semantic search. The merged result surfaces both.

What This Does Not Solve

Hybrid retrieval is better than either approach alone for finding relevant text. It does not solve the problem of understanding the structure of what was found.

A hybrid search can find the clauses most relevant to the question “what is the liquidated damages exposure if the contractor misses the mechanical completion milestone.” It cannot, on its own, understand that the answer requires combining a rate from one clause, a cap from another, and a milestone date from a third. That reasoning requires the knowledge graph, which already holds those facts in structured form.

The retrieval layer’s job in this context is to surface the verbatim source text so the answer can be cited. The graph traversal finds the structured facts. The hybrid retrieval finds the clauses those facts came from. The combination produces an answer that is both correct and traceable to its source.

Where I Am Taking This

The next article looks at what happens when retrieval fails partially: how to design a system that degrades gracefully when individual stores are unavailable, and why handling partial failure explicitly is more useful than treating it as an edge case.

Why Relationships Are First-Class Data

Abhishek Gawde — Sat, 28 Mar 2026 18:58:46 GMT

Most systems that store structured data use relational databases. Tables, rows, columns, foreign keys. The model works extremely well for a wide range of problems. It is fast, well-understood, and has decades of tooling built around it.

The model has one structural limitation that matters a great deal for document intelligence: relationships between entities are not stored. They are reconstructed.

When you want to know which obligations belong to a particular contractor, a relational database does not retrieve that connection directly. It joins two tables: an obligations table with a contractor ID column, and a contractors table with an ID column. The join reconstructs the relationship at query time by matching those IDs. The relationship itself, the fact that this contractor has these obligations on this project, does not exist anywhere in the database as a thing you can query, inspect, or attach properties to.

In a knowledge graph, the relationship is stored as a first-class object. It has its own identity, its own properties, and its own place in the data model. The connection between a contractor and an obligation is not something you reconstruct. It is something you traverse.

The same question, answered two ways. In a relational database, each connection is reconstructed at query time through joins that had to be anticipated in the schema. In a knowledge graph, each connection is traversed directly because it was stored as a first-class object. The relationship itself carries provenance, timestamps, and confidence.

What This Looks Like in Practice

Take a question that comes up on a real project: which obligations are blocking the commercial operation date, and are any of those assigned to a contractor who also has delayed permits?

To answer this, you need to follow a chain. Start with the obligations. Find the ones that block the completion milestone. For each of those, find out which contractor is responsible. Then check whether that contractor also has outstanding permit issues.

In a relational database, each of those steps is a join across a table. You join obligations to milestones to find the blocking ones. You join those results back to the contractors table to find who is responsible. You join again to the permits table to find delays. The more steps in the chain, the more joins you write, and the more joins you write, the more the query depends on having anticipated exactly this shape of question when you designed the schema.

In a knowledge graph, you follow the chain directly. An obligation has a BLOCKS relationship pointing to a milestone. It has an ASSIGNED_TO relationship pointing to a contractor. That contractor has relationships to its permits. You traverse each connection in turn. You are not reconstructing a relationship that was never stored. You are following one that was.

The practical difference is not just performance or query length. It is that a relational query has to be designed around the table structure, while a graph query can be designed around the question. When the questions are not known in advance, or when they change as the project evolves, the graph approach holds up better.

Why the Properties on Relationships Matter

Storing the relationship is one thing. The more important design choice is what you put on it.

In the system described throughout this series, every relationship between an entity and a fact carries seven fields: the source document, the document version, the clause reference, the page number, when the fact was extracted, when it became true in the real world, and the confidence score assigned during extraction.

These seven fields are not metadata attached to a node somewhere else in the graph. They are properties of the relationship itself. When you traverse from a project to its obligations and then to the parties those obligations are assigned to, you are not just getting a list of names. You are getting a chain of connections, each of which knows exactly where it came from, when it became valid, and how confident the extraction was.

This is what makes the system answer questions about provenance. “Where did this obligation come from?” is answered by reading the source document field on the CREATES_OBLIGATION relationship. “When did the deadline change?” is answered by looking at the SUPERSEDED_BY relationship between the old fact and the new one, and reading the valid_to field on the retired relationship.

None of this is possible if relationships are just foreign keys. A foreign key tells you that two rows are connected. It does not tell you anything about the nature of that connection.

The Document Relationship Model

One of the places where this design pays off most clearly is in how documents relate to each other.

Legal documents do not exist in isolation. An EPC contract is amended. The amendment is supplemented by schedules. The schedules reference permit conditions. The permit conditions create obligations that are tracked in a commissioning log. Each of these connections is meaningful and has legal significance.

In a document management system, these connections might exist as folder structures or metadata tags. In a relational database, they would require a separate junction table for each type of relationship. In a knowledge graph, they are modelled directly as typed relationships between document nodes.

A document SUPERSEDES another document. A document is AMENDED_BY another document. A document REFERENCES a third. A schedule is a SCHEDULES_TO attachment of a contract. Each of these relationship types carries its own meaning. You can traverse them. You can query for all documents that supersede a particular version. You can find the amendment chain for a contract. You can follow a reference from one document to the document it references and continue from there.

When a document is updated, the new version does not replace the old one. Instead, the old version gets a SUPERSEDED_BY edge pointing to the new one. Both versions remain in the graph. The default query filters for current versions. A historical query removes that filter and sees the full amendment chain.

This is not just an archiving decision. It means the system can answer: “What did the contract say before the February amendment?” The question is answered by querying the graph before the SUPERSEDED_BY edge was written, which is equivalent to filtering facts by their valid_to timestamp.

The Flat Table Problem

The instinct when coming from a relational background is to model everything as tables and reconstruct relationships as joins. This works until the relationships themselves start carrying meaning.

Consider the question: “Which obligations does this contractor have that are blocking milestones on the critical path, where those milestones are also dependencies for financing drawdowns?”

This requires traversing: obligation to party (via ASSIGNED_TO), obligation to milestone (via BLOCKS), milestone to financing condition (via IS_DEPENDENCY_FOR). Three hops, each with its own properties. In SQL, this requires three joins, possibly four if the financing condition table is separate from the drawdown table. The query is not impossible to write, but it is complex, it requires the joins to have been anticipated in the schema design, and it gets slower as the data grows because joins at this depth scan large intermediate result sets.

In a graph, this is three MATCH clauses. Each one follows a stored relationship. The traversal is bounded by the relationships that actually exist, not by the size of any table.

The tradeoff is real. For simple, predictable queries, relational databases are faster and more efficient. A flat table of obligations with a contractor ID column is faster to query for “list all obligations for contractor X” than a graph traversal. This is why the system described in this series uses a relational store for reporting and dashboards: it materialises the most common query patterns as flat rows for efficient retrieval. But the graph is the authority. The flat tables are a derived view of the graph, not a replacement for it.

What the Graph Cannot Do

Being honest about the tradeoffs matters here.

A knowledge graph is not a good place to store large amounts of unstructured text. Clause text does not belong in the graph. That is what the document store is for. A knowledge graph is not a good place to run aggregations over millions of rows. That is what the relational reporting layer is for.

The graph is the right choice when the questions require following relationships across entities, when the connections themselves carry properties that matter, and when the shape of the question cannot be fully anticipated in advance. It is not a universal replacement for relational databases. It is a complement to them, with a specific and well-defined role.

For document intelligence systems that need to answer questions about relationships between contracts, obligations, parties, milestones, and permits, and where those questions span multiple documents extracted over time, the graph is not an exotic choice. It is the minimum necessary architecture for the problem.

Where I Am Taking This

The next article looks at hybrid retrieval: why keyword search and vector similarity each fail in ways the other does not, and why running both simultaneously and combining the results consistently outperforms either approach alone.

The Party Problem: How a Knowledge Graph Figures Out That Two Names Mean One Company

Abhishek Gawde — Fri, 27 Mar 2026 18:02:10 GMT

Imagine a project with fifty documents. The EPC contract refers to the main contractor as “Siemens Energy GmbH.” A permit application mentions “Siemens Energy.” An amendment uses “SIemens AG”, a typo that appears in real documents. A technical report uses just “Siemens.”

If the system treats each of these as a separate entity, something breaks silently. The EPC contract creates an obligation assigned to “Siemens Energy GmbH.” The amendment modifies terms that apply to “SIemens AG.” But because these are different nodes in the graph, the connection is never made. The obligation floats, unlinked to the amendment. A query asking what obligations the main contractor carries returns an incomplete answer. Nobody notices until something goes wrong in the real world.

This is the entity resolution problem. It is not glamorous, but it is foundational. Every query that involves a party, every risk calculation that depends on understanding which company carries which obligations, every cross-project analysis that asks how much exposure a single counterparty represents across the portfolio: all of it depends on getting this right.

Why It Is Harder Than It Looks

The naive approach is exact string matching. If two documents use the same string, they refer to the same entity. If they use different strings, they do not.

This fails immediately in practice. Company names vary across documents for reasons that have nothing to do with intent: translated versions of a name, abbreviated forms used in operational documents, legal suffixes dropped in informal references, typographical errors that slip through review. In a corpus of several hundred documents produced over years by different people and different organisations, name variation is the norm rather than the exception.

The opposite failure is also real. Two different companies might share a common word or phrase. “German Solar GmbH” and “German Solar AG” might be the same company in different legal forms, or they might be two entirely different entities with similar names. A resolution system that is too aggressive merges things that should stay separate, which is in some ways worse than leaving duplicates, because merged nodes are harder to spot and correct.

The right approach has to handle both failure modes: catching genuine matches that vary in surface form, and not merging entities that happen to look similar but are actually different.

Three levels applied in order, cheapest first. Normalisation catches most cases at zero cost. Embedding similarity handles the remainder probabilistically. External identifiers provide the strongest signal when available, but are present in only a fraction of documents and should not be treated as a definitive override.

Three Levels, in Order

The design runs three levels of resolution, applied in increasing order of cost and reliability.

Level 1 is normalisation. Before anything is compared, every party name goes through a cleaning step: strip legal suffixes like GmbH, AG, Ltd, S.A., B.V.; lowercase everything; fix common typos; expand known abbreviations. This runs synchronously during ingestion, before any database write. It costs nothing because it is just string manipulation.

Normalisation catches a surprising proportion of cases. “Siemens Energy GmbH” and “Siemens Energy” become the same string after stripping the suffix and lowercasing. “SIemens AG” becomes “siemens” after normalisation, which matches “Siemens Energy” after the latter is also normalised. Not every case, but many.

The reason to apply this first rather than jumping straight to something more powerful is cost and speed. Every document that arrives triggers entity resolution for every party in it. At a hundred documents a day with dozens of parties per document, the volume is high. Level 1 is free. Level 2 and Level 3 are not.

Level 2 is embedding similarity. For parties that Level 1 did not resolve definitively, the normalised name is converted into a numerical representation and compared against all existing party nodes in the graph. The comparison produces a similarity score between zero and one.

Above 0.85, the system treats the match as confirmed. The new node is merged into the existing one. All relationships that pointed to the new node are repointed to the surviving node. The duplicate is retired.

Between 0.70 and 0.85, the similarity is suggestive but not conclusive. The system flags the match as ambiguous and routes it to a human review queue, showing both nodes side by side. A reviewer decides. Until they do, both nodes stay in the graph and the new one is marked as provisional.

Below 0.70, the system treats the name as a genuinely new entity. It creates a fresh node. If this turns out to be wrong later, correction is possible but requires manual intervention.

The threshold values are not arbitrary, but they are also not permanently fixed. The right calibration depends on the specific domain and document types being processed. In practice, these numbers need to be validated against real cases where the correct answer is known.

Level 3 is an external identifier. When a company registration number can be extracted from a document, it provides a stronger signal than name similarity alone. Two nodes with matching registration numbers are almost certainly the same legal entity, regardless of how different the names look on the surface.

The word “almost” matters. Registration numbers are not perfectly reliable as hard overrides for a few reasons.

First, they appear in only a fraction of documents. Formal construction contracts often include them, particularly in Germany where the Handelsregister number is standard in the parties section. Permit applications, technical reports, and operational correspondence often do not. In practice, registration numbers are available in perhaps a fifth to a third of cases. Level 3 fires when it can, which is less often than you might hope.

Second, they can be stale. A contract from 2019 might carry the registration number of a company that was subsequently acquired or restructured. The registration number in the old document no longer maps cleanly to the current entity. Using it as an automatic override would merge nodes that should stay separate.

Third, conflicting registration numbers are themselves a signal worth examining rather than resolving automatically. Two documents that refer to the same party name but carry different registration numbers could mean a data entry error, a post-merger transition period, or two genuinely different entities with similar names. Any of these warrants human review rather than automatic resolution.

The right framing for Level 3 is that a matching registration number raises confidence high enough to auto-accept without needing embedding comparison, and a conflicting or missing registration number routes to the same ambiguous queue as Level 2 misses. It is a strong signal, not a definitive override.

What Happens When Two Nodes Merge

When the system confirms that two nodes represent the same entity, a merge operation runs.

All relationships pointing to the retiring node are repointed to the surviving node. If the retiring node had an obligation assigned to it, that obligation now belongs to the surviving node. If it was a signatory to a contract, that contract is now linked to the surviving node. Nothing is lost, it is all reattached.

The retiring node is not deleted. It stays in the graph with a status of MERGED and a pointer to the node it was merged into. This matters for two reasons.

First, audit trail. If someone later asks why an obligation is assigned to a particular party, the answer might trace back through a merge. The original extraction said “SIemens AG.” That node was merged into the canonical “Siemens Energy GmbH” node. The history is legible, not obscured.

Second, recovery. Merges can be wrong. If a reviewer later determines that two nodes were incorrectly merged, the merge can be reversed. The retired node still exists. Its original relationships can be restored. If the node had been hard deleted, recovery would require re-extracting from the source documents.

The Over-Merging Problem

There is a failure mode in entity resolution that is less obvious than missing a match but equally damaging: merging things that should stay separate.

The specific risk is transitive closure. If the system determines that node A matches node B, and separately that node B matches node C, it might conclude that A and C should also be merged. Sometimes this is correct. Often it is not. B might be an ambiguous entity that superficially resembles both A and C, while A and C are genuinely distinct.

Cascading merges that follow transitive links without checking the direct A-to-C relationship can collapse distinct legal entities into one. A large contractor with a common word in its name might end up merged with a subsidiary, a competitor, or a completely unrelated company that happens to share part of the name.

The safeguard is to treat each merge decision as independent. A confirms to B and B confirms to C does not automatically mean A confirms to C. Each pair needs its own comparison. Where the direct comparison is ambiguous, the merge does not happen automatically and routes to human review instead.

This also argues for a conservative confidence threshold rather than an aggressive one. Merging incorrectly is harder to spot and harder to correct than leaving a legitimate duplicate unmerged. A duplicate shows up as two separate nodes and is annoying. An incorrect merge shows up as a single node with contradictory or inflated data, and is much harder to detect.

When a party name is first extracted from a document, before Level 2 resolution has run, the node is created as PROVISIONAL. This is an explicit signal that the entity has not yet been confirmed.

PROVISIONAL nodes participate in the graph. They can have obligations assigned to them. They appear in query results. But they are flagged as unresolved, which means they surface in the human review interface and any answer that references them carries a caveat that the party has not been fully verified.

When Level 2 confirms a match, the PROVISIONAL node is merged and retired. When Level 2 determines the party is genuinely new, the PROVISIONAL flag is upgraded to a stable status. When Level 2 finds the match ambiguous, the PROVISIONAL status stays until a human decides.

The reason for this two-stage approach is that the alternative is worse. Waiting for resolution before creating any node at all means the graph is incomplete while resolution is pending. An obligation extracted from an EPC contract cannot be written until the contractor party is resolved, which could take hours if the resolution queue is backed up. With PROVISIONAL nodes, the extraction proceeds immediately. The resolution happens asynchronously. The graph is always current on what has been extracted, even if some party assignments are pending confirmation.

What This Changes at Portfolio Scale

The entity resolution problem looks manageable at single-project scale. Fifty documents, a few dozen parties, the occasional duplicate. The work of resolution is real but finite.

At portfolio scale it compounds. The same contractor appears in twenty projects. Their name varies slightly across projects because different document authors used different forms. Without cross-project entity resolution, the portfolio view shows twenty separate nodes for what is actually one counterparty. A risk calculation that asks “what is our total obligation exposure to this contractor across all projects?” is working from twenty different numbers rather than one.

Getting entity resolution right at the project level is a prerequisite for getting portfolio analytics right. The graph algorithms that surface cross-project patterns, the risk scoring that aggregates exposure across the portfolio, the community detection that identifies contractor dependency clusters: all of these depend on a graph where the same real-world entity is one node, not many.

Where the Design Is Still Imperfect

The embedding similarity threshold is the most honest open question. 0.85 as the confirmation threshold is a starting point based on the domain, not a validated calibrated value. Some legitimate matches will fall below it and route to human review unnecessarily. Some incorrect matches might exceed it for parties in similar industries with overlapping names. The right threshold requires running the system against a labelled dataset of known matches and non-matches. That calibration has not been done yet and requires real operating data rather than upfront design.

The harder problem is corporate structure. A parent company and its subsidiary might share a name root and appear in different documents with different roles. An acquired company might appear in old contracts under its pre-acquisition name and in new documents under its new name. A joint venture might be referred to by its parent names in some documents and its own legal name in others. None of the three resolution levels handle these cases cleanly. They all route to human review, which is honest but means the volume of human decisions grows in proportion to corporate complexity in the project portfolio.

The global Legal Entity Identifier, or LEI, is in principle the right answer for cross-border entity disambiguation. It is a standardised identifier maintained by a global registry with good coverage of financial market participants. In practice, smaller construction contractors and landowners often do not have LEIs, and the coverage in the infrastructure project domain is patchy enough that it cannot be relied on as a primary resolution mechanism. It is worth checking for when available but not worth designing the resolution architecture around.

Where I Am Taking This

The next article looks at why relationships in a knowledge graph are fundamentally different from foreign keys in a relational database, and what that difference makes possible for the kinds of questions document intelligence systems need to answer.

From Answering Questions to Acting on Them

Abhishek Gawde — Thu, 26 Mar 2026 18:01:49 GMT

The previous articles in this series have been about building knowledge and making it queryable. Documents go in, facts come out, questions get answered. The loop closes when a user types a question and reads the response.

That loop has a gap in it.

A project manager working on a live infrastructure project has obligations with deadlines approaching. Liquidated damages accruing on delayed milestones. Permit conditions expiring. Counterparties whose financial standing may have changed since the contract was signed. None of these become visible through the query layer unless someone thinks to ask about them. If nobody asks, nothing surfaces.

The knowledge graph has all of this information. What it lacks, in a pure query architecture, is initiative.

The Distinction That Matters

The query layer answers a question when asked. It is reactive. A user types “what obligations are approaching their deadline?” and receives a cited answer.

An agent is assigned work. It runs to completion on its own, consults the knowledge graph, does additional research if needed, and produces a deliverable: a report, a spreadsheet, an alert, a task. It does not wait for the question. It runs at 6am, finds what it needs to find, and has already posted a digest to the project channel before anyone arrives at their desk.

This distinction matters beyond the obvious. The query layer is bounded by what users know to ask. An agent operates on what the system knows should be checked. Those are not the same set of things.

What Agents Are and Are Not

Before going further it is worth being precise about what an agent means in this context, because the term covers a wide range of designs.

In this system, an agent is a structured workflow that runs a defined set of tool calls in a sequence determined by a planning step. The planning step is an LLM call that receives the current state of the knowledge graph, the agent’s brief, and a scoped set of tools it is allowed to use. It returns an ordered plan. The plan executes. Outputs are delivered.

This is not a fully autonomous agent that decides its own goals. The goals are declared in a registry. The scope is bounded. The tools are explicitly listed. The cost is capped per run. An agent that hits its cost ceiling mid-plan halts immediately and surfaces an alert. No partial output is delivered.

The reason for these constraints is the same reason the extraction pipeline uses confidence triage and a human review queue: autonomy without guardrails produces outputs that look complete but may not be. In a system where the outputs have financial and legal implications, the design principle is to prefer visible failure over silent approximation.

The Knowledge Graph as Agent Memory

What makes an agent in this system different from a general-purpose LLM agent is what it reads before it plans.

Before the planning step runs, the agent loads the current state of the knowledge graph scoped to its brief. An obligation monitoring agent loads the open obligations for the project, their deadlines, the parties assigned to them, the liquidated damages rates on the relevant contracts, and the outcomes of the last few runs of the same agent. The LLM planning step receives all of this as structured context. It is not reasoning from training data or from raw document text. It is reasoning from extracted, provenanced, current-status facts.

This matters for reliability. An LLM reasoning from raw documents might hallucinate an obligation that does not exist, or miss one that is buried in a schedule. An LLM reasoning from a knowledge graph that has already extracted, validated, and confidence-triaged those obligations is working from a curated fact set. The failure modes are different and considerably narrower.

Three Trigger Modes

Agents run on a schedule, in response to graph events, or on demand.

Scheduled agents handle monitoring tasks. They do not need a reason to run. Their job is to check what needs checking, on a cadence defined in the agent registry.

Event-driven agents respond to changes in the graph. When a new party node is created because a contract was processed, an agent can trigger to research that party’s financial standing. When the materialisation pipeline detects that a milestone’s forecast date has moved beyond its guaranteed date, an agent can trigger to pull the relevant LD clause, compute current accrual, and send an escalation. The event is the signal. The agent handles the response.

On-demand agents run when explicitly requested: a one-off counterparty analysis, a re-run after a data correction, a test. The same execution shell handles all three modes. The trigger mechanism is the only thing that differs.

Every agent runs the same seven steps regardless of how it was triggered. The planning step generates the tool-call sequence from the current graph context. If the cost cap is hit at any point, the agent halts immediately and delivers nothing.

Three Agents in Practice

The design is easier to understand with concrete examples. The following three agents illustrate the range of what this layer can do.

Obligation Deadline Monitor

This agent runs every weekday morning. Its brief is simple: find open obligations due within 30 days, and anything already overdue.

The context load pulls all open obligations for the project from the graph, enriching each with the liquidated damages rate and cap from the relevant contract. The planning step then scores each obligation by financial exposure: the LD rate multiplied by days at risk, capped at the contract ceiling. The top-risk items get their verbatim clause text retrieved from the document store so the digest can cite the source.

The output is a structured digest: a ranked list of obligations by financial exposure, with deadline, assigned party, LD exposure per day, clause citation, and a confidence flag on any fact that has not yet been human-verified. It goes to a SharePoint folder and a Teams post in the project channel.

What the agent cannot do is invent urgency that is not in the graph. If an obligation deadline is marked as unverified, the digest reflects that. If no obligations are approaching, the agent records a clean run and posts nothing. The output reflects the state of the knowledge graph, not the agent’s interpretation of it.

Estimated cost: roughly $0.08-0.15 per project per run.

Milestone Tracker and LD Accrual Monitor

This agent also runs on a daily schedule, but it has two execution paths depending on what it finds.

On a clean run, it queries all guaranteed milestone dates, checks their current status, reads the LD rate and cap for each milestone from the relational store, computes accrual, and produces a daily tracker spreadsheet. If cap utilisation is below warning thresholds, it posts a routine status update to the construction management channel and stops there.

On an escalation run, something has changed. A milestone has crossed a cap warning threshold, or a new delay has been detected. At this point the agent adds two steps: a web search for force majeure evidence in the project’s jurisdiction for the current month, and a retrieval of the verbatim LD clause and force majeure clause from the contract. The spreadsheet gains an additional tab listing extension of time candidates. An email goes to the construction director and commercial manager.

The escalation path is not triggered by a human noticing something. The agent checks the thresholds on every run. If the numbers cross a line, the escalation happens automatically. The value is not that the agent knows something the team does not know. It is that the agent checks every day, and the team gets the escalation the morning it becomes relevant rather than when someone remembers to look.

Estimated cost: $0.06-0.12 on a clean run, $0.15-0.25 on an escalation run.

Counterparty Health Monitor

This agent is event-driven rather than scheduled. It triggers when a new party node is written to the knowledge graph, typically when a new contract is processed by the extraction pipeline.

Its brief is to assess whether that party represents a concentration risk. The context load pulls all obligations and contract roles attributed to that party across every project in the graph. Total LD exposure, critical path obligations, roles across multiple projects: all of this is already in the graph because the extraction pipeline has been running across the whole portfolio.

The agent then researches the party’s public financial standing: checks for insolvency signals, recent news about construction delays, whether annual accounts filings are current. It combines the internal exposure picture with the external signals into a risk score.

The routing is determined by the score rather than by a human decision. Below 0.3, the brief goes to SharePoint only. Between 0.3 and 0.6, a Teams notification goes to the project team. Above 0.6, the output is flagged for legal review and an email goes out.

The risk score itself is written back to the party node in the knowledge graph as a flagged, unverified value, with full agent provenance. A human reviewer can confirm or override it. The agent’s assessment is an input to the review, not a final verdict.

Estimated cost: $0.15-0.30 per run, variable by the number of web searches required.

Three agents, three trigger modes. DEV-01 monitors obligations on a weekday schedule. EPC-01 runs daily with a conditional escalation path when cap thresholds are crossed. DEV-05 fires on a graph event when a new party is created.

What the Agent Cannot Do

The boundaries matter as much as the capabilities.

An agent cannot trigger the ingestion or extraction pipelines. If a new document appears in SharePoint, an agent does not ingest it. That is the ingestion pipeline’s job, triggered by its own scan cycle. An agent reads from the knowledge graph. It does not build it.

An agent cannot write authoritative facts to the knowledge graph. Any value an agent writes, a risk score, a computed accrual, a flagged anomaly, appears in the human review queue exactly as a flagged extraction from the pipeline does. An agent write is never authoritative on first write. It is a flagged value awaiting human confirmation.

An agent cannot decide to expand its own scope. The context query is declared in the registry. The tool set is scoped per agent. An obligation monitoring agent does not decide mid-run to start researching counterparty financial standing. That is a different agent with a different brief.

These constraints exist because the value of the system comes from predictability. An agent that behaves consistently, produces outputs at known cost, and fails visibly when it cannot complete is more useful in practice than one that is more capable but less predictable.

What This Changes

The shift from a pure query system to one with an agentic layer changes what the system is for.

A query system requires users to know what to ask and remember to ask it. In a project environment with hundreds of documents and dozens of obligations, that is a meaningful limitation. Important things surface only when someone thinks to look for them.

An agent-augmented system transfers that responsibility to the platform. The obligation monitoring agent knows that open obligations with approaching deadlines should be surfaced every weekday morning. It does not wait for a user to remember to ask. The milestone tracker knows that liquidated damages accrual should be computed and posted daily. The counterparty monitor knows that a new party node is an event worth acting on.

The knowledge graph, built through the extraction and retrieval architecture described in the previous ten articles, becomes the foundation for something that resembles an attentive colleague rather than a searchable database. One that checks the things worth checking, surfaces what matters before it becomes urgent, and leaves a full audit trail of everything it did and why.

Where I Am Taking This

The next two articles look at what becomes possible when project graphs are connected at portfolio scale: the graph algorithms that surface patterns no individual query can find.

What It Actually Takes to Make Documents Answerable

Abhishek Gawde — Thu, 19 Mar 2026 12:49:39 GMT

Nine articles in, the individual pieces are clear. This one steps back and looks at what the design adds up to: the problem, the architecture, and why each part of it exists.

A project manager working on a live infrastructure project has a specific problem. The information they need is written down. Obligations, deadlines, liquidated damages rates, permit conditions, payment milestones: all of it exists, in contracts and permits and financing agreements sitting in a document management system. But there is no way to ask a question and get an answer. The workflow is manual. Open the document, search, read, repeat. Hours per query. Incomplete results. Missed things.

The consequences of missing obligations in this context are not minor. Liquidated damages accrue silently. Permit conditions expire without action. Lender covenants get breached. The cost is financial and legal, and it compounds the longer the gap goes undetected.

This series has been about designing a system to close that gap. Not a general-purpose document search tool, but a system that extracts structured knowledge from project documents, models the relationships between facts across documents, and answers questions in natural language with citations back to the source clauses.

The previous nine articles covered each design decision individually. This one looks at what they add up to.

Five pipelines, one build/consume boundary. P1 through P3 build knowledge into the graph. P4 and P5 consume it. Nothing in the consume layer writes to the primary store.

Why Standard Approaches Do Not Reach Far Enough

The first thing to understand is why the obvious solutions fall short.

Document management systems provide storage and keyword search. They can find a document that contains a phrase. They cannot extract the facts inside that document, model them, or answer a question that spans multiple documents.

General-purpose retrieval augmented generation systems get further. They embed documents into a vector index and retrieve the chunks most similar to a question. For questions like “what does clause 14.3 say,” this works well. For questions like “what is the financial exposure if a contractor misses the mechanical completion milestone,” it breaks down. That question requires identifying the obligation assigned to the contractor, finding the liquidated damages rate on the contract, establishing that the milestone is on the critical path, and connecting those facts across documents. That is a graph traversal. Similarity search has no concept of a graph traversal.

The gap is not retrieval quality. It is retrieval structure. The question requires following relationships between entities, and a flat vector index has no way to represent or traverse those relationships.

Three stores, each with a distinct role. Neo4j is the system of record. AI Search holds the verbatim evidence. SQL is a derived projection for reporting efficiency. If Neo4j and SQL conflict, Neo4j wins.

The Architecture in Full

The system is built on three specialised stores and five processing pipelines. Each has a defined role. None does more than its role.

The processing runs in sequence. Documents are fetched from SharePoint and ingested: parsed, classified by document type, chunked at clause boundaries, embedded, and written to the document store. The ingestion step also writes a skeleton of each document to the knowledge graph: the document node, the parties identified from the text, and the relationships between documents that reference each other.

Once a document is ingested, extraction runs. A set of focused jobs, each scoped to specific document types, reads the clause chunks and extracts structured facts: obligations, commercial terms, permit conditions, milestones, party roles. Each job is narrow by design. The obligation extraction job runs against legal and commercial documents only. It does not run against engineering drawings, where language that looks legally significant is not. The scoping is structural, not a prompt engineering choice.

Extracted facts write into the knowledge graph with full provenance: which document, which version, which clause, which page, when extracted, what confidence the model assigned to each field. Nothing is written without provenance. The reason is specific: without it, document versioning is impossible. When an amendment arrives six months later and changes a rate, the system needs to know what the old rate was, when it was superseded, and which document introduced the change.

A materialisation step projects the graph into flat SQL tables for reporting. This is a derived copy, not an authoritative one. If the graph and the SQL tables disagree, the graph wins. The SQL tables exist because flat row lookups are more efficient for dashboards and reporting queries than graph traversals.

The query layer accepts natural language questions and produces cited answers. The routing step embeds each question and compares it against a library of known question templates. Questions similar enough to a known template take the fast path: fill in the parameters and execute the pre-written graph query. Novel questions take the full path: a language model writes a graph query for this specific question, executes it, and if successful, adds the new template to the library for future use.

The retrieval that follows uses two stores simultaneously. The knowledge graph traversal returns the structured facts: which obligations exist, who they are assigned to, what the deadline is, what the financial consequence is. The document store retrieval returns the verbatim clause text that each fact was extracted from. Both are assembled into a single package before the language model generates the answer. The model receives the structured facts and the source clauses. Its role is to express the answer in clear language and attach the citations. It does not query the stores directly. It does not reason over raw documents. It receives a prepared package and produces the response.

The Design Decisions and What Each Was Responding To

Looking back, the architecture is a series of responses to specific failure modes rather than a set of aspirational choices.

The graph is the primary store because multi-hop questions are real. A question about which obligations are blocking a milestone, assigned to a party that also has delayed permits, requires following three relationship hops across facts that came from different documents. A flat index cannot represent this. A graph can. The graph was not chosen because it is technically interesting; it was chosen because the question set that matters requires it.

Extraction is job-oriented because single-pass extraction produces noise. Running one large extraction pass over every document produces false positives from documents where legal-sounding language has no legal meaning, misattributed obligations because party identification has not run yet, and facts extracted without the context to validate them. Focused jobs, each asking one question of a defined set of document types, and each running in dependency order, resolve all three.

Provenance is on every fact because history matters as much as current state. A system that stores only current values cannot answer “what did we believe about this obligation before the amendment” or “when did the LD rate change.” The seven-field provenance model, with system time and valid time tracked separately, makes both kinds of question answerable. This is not metadata. It is the mechanism that makes the system trustworthy in a legal context.

The ontology is the single source of truth because the alternative does not scale. A system where document type definitions, extraction schemas, database constraints, and prompt instructions are maintained in separate places accumulates inconsistencies as it grows. Adding a new document type requires finding every place that definition lives and updating them in sync. The ontology-driven approach encodes the entire domain model in one file and generates all downstream artefacts from it. One edit propagates everywhere. The discipline is harder to establish at the start and pays back continuously after.

Confidence triage exists because wrong facts are not inert. A fact written to the graph with low confidence does not sit quietly. It is traversed by every subsequent query. An obligation assigned to the wrong party corrupts every answer to questions about that party’s exposure. The three-bucket routing, auto-accept above 0.90, flagged between 0.70 and 0.90, quarantined below 0.70, keeps known-unreliable data out of the live graph while keeping the error surface visible rather than hidden.

The question library is not a cache because cached answers go stale. The library stores query templates, not answers. A template matched to a question executes fresh against the live graph on every call. The answer reflects the current state of the knowledge base, not the state it was in when the template was written. The fast path saves the cost of generating a new query. It does not sacrifice freshness.

What the System Cannot Do

It is worth being explicit about the boundaries.

The system extracts what the ontology defines. Documents outside the defined taxonomy are ingested and searchable but do not have structured facts extracted from them. Expanding the coverage means expanding the ontology, not rebuilding the pipeline.

Extracted facts carry confidence scores and provenance, but they do not carry legal weight without human verification. The system surfaces information. It does not certify it. A fact with a confidence of 0.95 is still a machine extraction. It is useful. It is not a substitute for human review in a context where the decision has legal consequence.

The system also does not know what it has not yet seen. A question answered on day ten of a project, when only twenty documents have been processed, reflects those twenty documents. It does not reflect the thirty still in queue. The processing status is tracked and can be surfaced alongside the answer, but the incompleteness is real and the system cannot reason over documents it has not yet extracted.

What the Design Adds Up To

The series started with a simple observation: there is a class of question about project documents that no standard retrieval approach can answer. The question requires structured facts, modelled relationships, provenance, and citation. Getting all four at once requires more architecture than most document intelligence systems carry.

The nine decisions documented in the previous articles, chunking at clause boundaries, job-oriented extraction, confidence triage, two-clock provenance, ontology-driven design, two-tier query routing, graph-first retrieval, evidence grounding, continuous evaluation, are each a response to a specific failure mode that appears if you do not make that choice. None of them are ornamental.

The result is a system that can answer “which obligations are approaching their deadline, who is responsible, and what is the financial exposure if they are missed” from a corpus of several hundred documents, in seconds, with citations. That question was previously a manual search task measured in hours.

Whether this particular set of choices is the right one for every similar problem is a different question. The document types, the confidence thresholds, the routing parameters, the specific stores: all of these are tunable. What the architecture establishes is the structure within which those choices can be made and revised.

The File That Runs the System

Abhishek Gawde — Wed, 18 Mar 2026 16:37:38 GMT

Most document intelligence systems scatter their domain knowledge across schemas, prompts, database definitions, and application code. When something changes, the work is finding everywhere it needs to change. There is a better structure.

A single YAML file defines the entire domain model. A code generation step produces eight downstream artefacts from it. One edit propagates everywhere. Nothing drifts.

Every system that extracts structured knowledge from documents has to answer the same set of questions. What document types exist? What facts are worth extracting from each one? What entities and relationships go into the database? What does the LLM need to be told to find the right things? What does the database need to enforce to prevent bad data from getting in?

The naive approach is to answer these questions in place: write the extraction schema where the extraction happens, write the database constraint where the database is defined, write the prompt where the LLM is called. This works right up to the moment something needs to change. A new document type. A new field on an existing entity. A renamed relationship. Each change requires finding every place that definition lives and updating them in sync. The more the system grows, the more places there are to update, and the more likely it is that one of them gets missed.

The design that holds the whole system together is simpler: answer all of those questions in one place, and generate everything else from it.

One File

The entire domain model lives in a single YAML file. It defines what document types the system recognises, what entities and relationships can exist in the knowledge graph, what each extraction job is looking for and in which documents, and what happens when two documents contradict each other on the same fact.

A code generation step reads this file and produces all the downstream artefacts from it: the validation schemas that check every extraction result before it touches the database, the database constraints that enforce the graph structure, the write patterns for creating and updating entities, the table definitions for the relational store, the prompt fragments that tell the LLM what to look for in each document type, and the API documentation.

None of those artefacts are written by hand. They are outputs. The YAML file is the only input.

The consequence is specific and important: there is one place where the definition of an Obligation lives. One place where the list of document types that contain obligations is declared. One place where the fields that must be present on every extracted fact are specified. When any of those change, the change happens once and propagates everywhere.

What the File Actually Contains

It is worth being concrete about what this file holds, because the scope is wider than it might initially seem.

The document taxonomy. Every document type the system can recognise and process, organised into categories. An EPC contract and a bird migration assessment are different document types. A permit and a financing agreement are different document types. The classification step at ingestion uses this taxonomy. The extraction jobs use it to know which documents they should and should not run against.

The entity and relationship model. Every node type that can exist in the graph, with its properties and which ones are required. Every relationship type, with its valid source and target node types. A relationship that connects the wrong kinds of nodes fails at the database constraint level, not silently at query time.

The extraction job registry. Every extraction job is declared here: which document types it applies to, which entities and relationships it produces, which trigger words to search for in the document chunks, how many chunks to retrieve, and which other jobs it depends on. The obligation extraction job applies to EPC contracts, permits, financing agreements, and a handful of other legal document types. It does not apply to engineering drawings or photo logs. That scoping is declared here, not hardcoded in the extraction pipeline.

Conflict resolution rules. When two documents contain contradictory facts about the same entity, which one takes precedence? An amendment supersedes the base contract. A signed version supersedes an unsigned draft. These rules are declared explicitly. When the extraction pipeline encounters a conflict, it reads the rules and resolves it. Without explicit rules, conflicting facts either silently overwrite each other or both survive with no indication of which is current.

Why Scoping Matters More Than It Looks

The extracts_from field on each extraction job deserves specific attention because it is easy to underestimate.

Engineering drawings contain phrases like “shall be installed at a specified height.” The word “shall” appears. Without scoping, an obligation extraction job would run against that drawing and flag it as a contractual obligation. It is not. It is a construction specification. It creates no legal duty, has no responsible party in a contractual sense, and has no financial consequence if missed.

The scoping filter prevents this. The obligation extraction job is declared to run against legal and commercial document types only. When a drawing arrives, the job does not run. No false positive enters the system. No prompt engineering is required to teach the LLM to distinguish between these cases. The distinction is structural, declared once in the ontology, and enforced by the job registry before the LLM is ever called.

This is the difference between solving a problem with logic and solving it with better prompts. Prompts can be improved but they can also regress. Structural constraints do not.

What Code Generation Produces

A concrete example makes the propagation clearer.

Suppose the obligation entity acquires a new field: a severity classification that rates the financial risk if the obligation is missed. Adding this field to the ontology and running the code generation step produces: an updated validation schema that requires the field on every extracted obligation going forward, an updated database constraint that enforces the field at write time, an updated SQL table definition with the new column, and an updated prompt fragment that instructs the LLM to identify and record the severity classification when extracting obligations.

Those four things update together, automatically. There is no separate prompt update. No separate migration script. No separate schema file to find and edit.

The other side of this is regression detection. The schema version that generated each artefact is tracked. When the ontology changes, the system can identify which existing extractions were produced under an older schema version and may be missing new fields. The staleness is visible rather than hidden inside data that looks complete.

Conflict Resolution as a First-Class Design Decision

Most systems that extract facts from multiple documents handle conflicts implicitly. Last write wins, or the newest document wins, or conflicts are silently ignored. None of these are defensible choices for a system where the facts carry legal weight.

The conflict resolution rules in the ontology make this explicit. When the extraction pipeline encounters a fact that contradicts something already in the graph, it checks the rules. If the new document has higher precedence, the old fact is marked superseded with a timestamp, a link is created to the new fact, and both are retained. If precedence is ambiguous, the conflict surfaces in the human review queue. Nothing is silently overwritten. Nothing is silently ignored.

This means the graph always reflects current authoritative state by default, but the history is preserved. A query that asks for current facts filters to what is active now. A query that asks what was believed at a specific date in the past is also answerable, because the supersession chain is intact.

The conflict resolution rules being declared in the ontology rather than hardcoded inside the merge logic means they can be read, reviewed, and changed without touching application code. They are part of the domain model. The application code reads and applies them.

The Operational Consequence

The reason this design matters in practice is what it costs to extend the system.

Adding a new document type means adding an entry to the ontology, running the code generation step, and writing a classifier training example. The ingestion pipeline, the extraction jobs, the database constraints, the API documentation, all update from that single addition. There is no checklist of files to update. There is no risk of one layer knowing about the new document type and another layer not.

The same applies to adding an extraction field, adding a job, or changing a relationship type. The ontology is edited. The pipeline is regenerated. Everything stays in sync because everything comes from the same source.

The alternative, encoding these definitions in multiple places and keeping them in sync manually, scales badly. It works for small systems early in development. It becomes a maintenance burden as the system grows, and the maintenance burden grows faster than the system does.

Where I Am Taking This

The series so far has covered each pipeline stage individually. The next piece steps back and looks at the system as a whole: what the full architecture adds up to, what each design choice was responding to, and what the tradeoffs look like with the benefit of the complete picture.