Sihari reordering
The dependent vowel ਿ is written before its consonant but must be encoded after it. gurmukhifix moves it back.
OCR mangles connected scripts: sihari lands before its consonant, nuktas drift, diacritics scatter. gurmukhifix repairs Tesseract output into well-formed Unicode for Gurmukhi, Punjabi, Hindi, Devanagari, Urdu and Farsi — and never corrupts text that was already correct.
ਿਸੱਖ ਧਰਮਸਿੱਖ ਧਰਮSihari ਿ reordered after its base consonant — the #1 systematic Gurmukhi OCR error.
Paste raw OCR text and watch it become clean, well-formed Unicode. Everything runs in your browser.
Upload a photo or scan of a single line/word. In-browser OCR uses Tesseract.js, then gurmukhifix cleans the result. Handwriting accuracy is limited — this demos the pipeline, not production OCR.
gurmukhifix is a post-processor, not an OCR engine. Tesseract turns the image into characters; gurmukhifix applies the linguistic rules Tesseract can't.
Run Tesseract with JSON output: characters, confidence and bounding boxes.
≥85% passes through, <60% is flagged, the middle band is corrected.
A fix is applied only if it lowers script-validity badness — correct text is never changed.
Corrected text, a per-fix report and preserved layout metadata.
The dependent vowel ਿ is written before its consonant but must be encoded after it. gurmukhifix moves it back.
A nukta after a vowel sign (ਸਾ਼) is reordered to the canonical consonant+nukta+vowel (ਸ਼ਾ).
Corrections require validity evidence. Already-correct Unicode round-trips byte-for-byte — enforced by CI.
Orphaned matras, impossible sequences and out-of-script code-points are surfaced with severity.
Parallel batch processing and a SQLite store that promotes repeatedly-confirmed corrections.
Bounding boxes flow through end-to-end so downstream tools can rebuild the page.
Shared engine, per-script rules with inheritance via extends. Click any script for a plain-English deep-dive.
The script of Sikh scripture and one of the writing systems for Punjabi.
Deep-dive → ਪੰPunjabi written in the Gurmukhi script — it builds on every Gurmukhi rule.
Deep-dive → हिHindi written in the Devanagari script.
Deep-dive → देThe shared base script behind Hindi, Marathi, Nepali and Sanskrit.
Deep-dive → اُUrdu in the Nasta'liq style — a connected, right-to-left script.
Deep-dive → فاPersian (Farsi) — Arabic-script with Persian-specific letters.
Deep-dive →Pure-Python, MIT-licensed and free for anyone. Tesseract is a peer dependency, not a runtime requirement.
pip install gurmukhifixtesseract page.tif out --oem 1 --psm 6 json
gurmukhifix correct --input out.json \
--lang gurmukhi --output ./resultsgurmukhifix batch --input-dir ./pages \
--lang devanagari --workers 4Questions, bug reports, research collaborations or integration help — send a message and I'll get back to you.
Thanks — your message has been sent. I'll be in touch soon.