← Home

Field notes

Why we built gurmukhifix: 600 years of manuscripts vs. one stubborn OCR problem

It started with a pile of scans. Thousands of them: handwritten manuscripts, ledgers and printed pages from north-west India spanning roughly the 1400s to the present day. Gurmukhi scripture and Punjabi correspondence. Persian and Urdu administrative records. Hindi and Devanagari texts. The goal was simple to state and hard to do — turn the images into searchable, reusable digital text.

The promise and the wall

Modern OCR feels like magic on clean printed English. So we pointed Tesseract at the collection and waited. What came back looked, at a glance, like text. But when we tried to search it, index it, or paste it into a document, it fell apart.

For handwritten Gurmukhi and Urdu, character error rates routinely ran past 30–40%. Worse, the errors weren't random noise — they were systematic, and they produced Unicode that was subtly, invisibly broken.

The bug you can't see

Take one example that haunted the Gurmukhi pages. The vowel sign sihari (ਿ) is written to the left of the consonant it belongs to — but the Unicode standard says it must be stored after that consonant. Tesseract, faithfully, writes down what it sees: the sihari first. The result renders almost correctly on screen, so it passes the eye test. Then you search for the word and get nothing, because at the byte level it's a different, impossible sequence.

Persian and Urdu had their own version of this: a single misplaced or dropped nukta turning ب into پ, or ی silently encoded as ي. Hindi had matras detached from their consonants. Every script had a handful of these — predictable, rule-shaped failures that no amount of re-running Tesseract would fix, because Tesseract has no idea what the script's rules are. Its job is pixels to characters. It does not know that a dependent vowel cannot begin a word.

Why the obvious fixes didn't work

The tempting first move is a find-and-replace table: "whenever you see X, write Y." We tried versions of that. It was a disaster. A blind substitution rewrites the letters that were already correct, and on a corpus that is mostly correct, that means you corrupt far more than you fix. An early naive pass actually made the text worse than raw Tesseract — by a lot.

The other tempting move is "just train a better model." That helps the recognition step, but it's expensive, needs labelled handwriting data we didn't have, and still leaves the structural Unicode problems untouched.

The idea: correct only with evidence

What finally worked was a different framing. Don't guess. Only change a character when there's evidence that the change makes the text more linguistically valid. If a word is already well-formed, leave it completely alone. If a sihari is sitting where a vowel sign can't legally sit, reorder it — because that move provably resolves a violation. If two letters are genuinely ambiguous and there's no signal which is right, don't flip a coin; flag it for a human.

That principle — evidence-gated correction — became gurmukhifix. It sits after Tesseract, reads its JSON output, and applies the script-specific rules Tesseract can't: reorder the sihari, canonicalise the nukta, repair the hamza carrier, normalise to clean NFC Unicode, and surface anything it isn't sure about. Crucially, it is measured against a hard rule in continuous integration: it must never make a page worse than raw Tesseract.

Why all these scripts, together

The archives of north-west India don't come neatly sorted by script. A single shelf might hold Gurmukhi scripture, a Persian land record and an Urdu letter. Existing post-processing tools, where they existed at all, covered one script in isolation. We needed one pipeline that understood Gurmukhi, Punjabi, Hindi, Devanagari, Urdu and Farsi — sharing an engine, differing only in their rules. That's what gurmukhifix is.

The result

On our benchmarks, the corrected output now improves character error rate substantially over raw Tesseract on the error cases, while leaving clean text untouched — and it never regresses. The text that comes out is canonical Unicode you can actually search, index and trust.

It is not a replacement for Tesseract, and it can't read handwriting that Tesseract fundamentally couldn't. It is the missing layer between "the OCR ran" and "the text is usable." For anyone trying to make six centuries of a region's writing searchable, that layer turned out to be the whole game.

gurmukhifix is free and open source under the MIT licence. Try the live demo or pip install gurmukhifix.