Tesseract post-processing engine

Clean Unicode for
handwritten Gurmukhi & Indic scripts

OCR mangles connected scripts: sihari lands before its consonant, nuktas drift, diacritics scatter. gurmukhifix repairs Tesseract output into well-formed Unicode for Gurmukhi, Punjabi, Hindi, Devanagari, Urdu and Farsi — and never corrupts text that was already correct.

+37.8%avg CER improvement
6scripts supported
0corruption of clean text
Raw OCRਿਸੱਖ ਧਰਮ
↓ gurmukhifix
Unicodeਸਿੱਖ ਧਰਮ

Sihari ਿ reordered after its base consonant — the #1 systematic Gurmukhi OCR error.

Interactive playground

Paste raw OCR text and watch it become clean, well-formed Unicode. Everything runs in your browser.

Input — raw OCR
Output — clean Unicode

How it works

gurmukhifix is a post-processor, not an OCR engine. Tesseract turns the image into characters; gurmukhifix applies the linguistic rules Tesseract can't.

1

Image → Tesseract

Run Tesseract with JSON output: characters, confidence and bounding boxes.

2

Confidence routing

≥85% passes through, <60% is flagged, the middle band is corrected.

3

Evidence-gated repair

A fix is applied only if it lowers script-validity badness — correct text is never changed.

4

Clean Unicode

Corrected text, a per-fix report and preserved layout metadata.

Sihari reordering

The dependent vowel ਿ is written before its consonant but must be encoded after it. gurmukhifix moves it back.

Nukta canonicalisation

A nukta after a vowel sign (ਸਾ਼) is reordered to the canonical consonant+nukta+vowel (ਸ਼ਾ).

Never corrupts good text

Corrections require validity evidence. Already-correct Unicode round-trips byte-for-byte — enforced by CI.

Validity report

Orphaned matras, impossible sequences and out-of-script code-points are surfaced with severity.

Batch + learning

Parallel batch processing and a SQLite store that promotes repeatedly-confirmed corrections.

Layout preserved

Bounding boxes flow through end-to-end so downstream tools can rebuild the page.

Six scripts, one pipeline

Shared engine, per-script rules with inheritance via extends. Click any script for a plain-English deep-dive.

Gurmukhi

The script of Sikh scripture and one of the writing systems for Punjabi.

Deep-dive →
ਪੰ

Punjabi

Punjabi written in the Gurmukhi script — it builds on every Gurmukhi rule.

Deep-dive →
हि

Hindi

Hindi written in the Devanagari script.

Deep-dive →
दे

Devanagari

The shared base script behind Hindi, Marathi, Nepali and Sanskrit.

Deep-dive →
اُ

Urdu

Urdu in the Nasta'liq style — a connected, right-to-left script.

Deep-dive →
فا

Farsi

Persian (Farsi) — Arabic-script with Persian-specific letters.

Deep-dive →

Get started

Pure-Python, MIT-licensed and free for anyone. Tesseract is a peer dependency, not a runtime requirement.

Install
pip install gurmukhifix
Run Tesseract → gurmukhifix
tesseract page.tif out --oem 1 --psm 6 json
gurmukhifix correct --input out.json \
  --lang gurmukhi --output ./results
Batch a folder
gurmukhifix batch --input-dir ./pages \
  --lang devanagari --workers 4

Get in touch

Questions, bug reports, research collaborations or integration help — send a message and I'll get back to you.