Sihari written before its consonant
The vowel sign sihari (ਿ) is drawn to the left of its consonant, but Unicode requires it to be stored after. Tesseract outputs what it sees, so the order is wrong.
ਿਸੱਖ→ਸਿੱਖThe script of Sikh scripture and one of the writing systems for Punjabi.
Gurmukhi is used for the Guru Granth Sahib and centuries of Sikh and Punjabi manuscripts across north-west India. Most historical material is handwritten, where Tesseract struggles most.
The vowel sign sihari (ਿ) is drawn to the left of its consonant, but Unicode requires it to be stored after. Tesseract outputs what it sees, so the order is wrong.
ਿਸੱਖ→ਸਿੱਖA nukta (਼) belongs immediately after its consonant (consonant → nukta → vowel). OCR often emits the vowel first, which Unicode normalisation will not fix.
ਖਾ਼ਸ→ਖ਼ਾਸHandwritten ਕ/ਖ, ਗ/ਘ, ਪ/ਫ differ by a single stroke and are frequently swapped.
| Raw OCR | gurmukhifix | What happened |
|---|---|---|
ਿਸੱਖ ਧਰਮ | ਸਿੱਖ ਧਰਮ | Sihari reordered after its consonant |
ਖਾ਼ਸ ਗੱਲ | ਖ਼ਾਸ ਗੱਲ | Nukta moved before the vowel sign |
ਸਤਿ ਸ੍ਰੀ ਅਕਾਲ | ਸਤਿ ਸ੍ਰੀ ਅਕਾਲ | Already correct — left untouched |
Tesseract turns pixels into characters. It has no linguistic knowledge — it can't know that a dependent vowel may not begin a word, or that a sign written to the left of a letter must be encoded after it. gurmukhifix adds exactly those rules.
A blind substitution table rewrites correct letters too and corrupts good text. gurmukhifix is evidence-gated: a fix is applied only when it makes the text more valid, so already-correct Unicode is never changed.
Raw OCR often looks right but is malformed Unicode — wrong code-point order, dropped marks. That silently breaks search, indexing, fonts and copy-paste. gurmukhifix produces canonical, well-formed text.