The Missing Bracket: How Tiny Formatting Errors Break Outputs

Ambiguity from a missing closing bracket in a legal passage caused inconsistent model results, showing that thoroughly reading and correcting the input is essential for reliability.

When I was on the API team at OpenAI, I ran into a problem that changed how I think about “model accuracy,” especially in high-stakes domains like law.

One of our partners was exploring how well GPT-3 could handle legal documents—contracts and case-like text where meaning depends on precise language and structure. It’s an obviously compelling space: the legal world runs on documents and interpretation, and anything that speeds up review and analysis could have huge impact. Some court cases involve millions of documents, and that volume alone can drag timelines out for ages.

A side note that always stuck with me: one of the earliest widely available datasets used for natural language processing research was the Enron email corpus—made public through court discovery. It’s a funny thought that some early NLP systems learned their language patterns from Enron.

Back to the task. I was given a text example and a question about it—something along the lines of whether one party’s obligations affected another party. The model was getting the answer “right” around 60–70% of the time. That range is interesting because it suggests the model isn’t clueless; it’s seeing something real in the text. When a model is truly lost, you often get near-random results, like a coin flip. But 60–70% implies it’s finding signal, just not reliably.

So I did what I usually do: prompt iteration. I tried different formulations, adjusted wording, and pushed the model toward more structured analysis. I managed to raise the success rate to about 75%, with more consistent answers aligned with what we expected.

And that’s where the frustration kicked in. If I could get it to 75% with prompt wording, why not 90%? Why not 100%? I tried everything I could think of—having the model explain the document, rewrite it in its own words, break it into parts—anything that might help me assess whether it truly “understood” the text.

Nothing worked.

Eventually, after what felt like my nth attempt, I made a realization: I hadn’t actually looked at the document carefully myself. I’d been treating the model like a specimen in a lab experiment and trying to tweak the experimental setup—when I should have examined the specimen.

So I finally read the text closely.

It was full of nested clauses with brackets: a bracket opening, a sub-clause in brackets, another bracket inside that, and so on—then closing brackets to wrap everything up. Except something was off.

There were three opening brackets and only two closing brackets.

That single missing bracket meant the passage was genuinely ambiguous. And in legal writing, ambiguity isn’t theoretical—it’s the kind of detail that can decide outcomes. Entire court cases have hinged on punctuation, formatting, or a single structural token that changed how a clause was interpreted.

Once I added the missing closing bracket back in, the model’s behavior snapped into place. It started giving the same correct answer consistently, over and over again. The problem wasn’t prompt wording. The problem was that the document itself had a structural error that created multiple plausible readings—and the model had been doing the best it could under ambiguity.

Looking back, it was a useful lesson about the GPT-3 era in general. These weren’t “reasoning” models in the way we think about some newer systems today. They didn’t naturally stop to question whether the source text was malformed. You could sometimes prompt that kind of checking for short inputs, but with hundreds or thousands of words, it was far less practical.

Today, larger models—and especially reasoning-oriented models—can do more self-checking and verification. But even now, the core lesson holds: when a model is inconsistently right, it might not be the model. It might be the input.

Sometimes, the most important “prompt engineering” step is simply reading the document.

The Missing Bracket: How Tiny Formatting Errors Break Outputs

More to read

Next