← All scripts

Gurmukhi ਗੁਰਮੁਖੀ

The script of Sikh scripture and one of the writing systems for Punjabi.

Where it's used

Gurmukhi is used for the Guru Granth Sahib and centuries of Sikh and Punjabi manuscripts across north-west India. Most historical material is handwritten, where Tesseract struggles most.

What Tesseract gets wrong

Sihari written before its consonant

The vowel sign sihari (ਿ) is drawn to the left of its consonant, but Unicode requires it to be stored after. Tesseract outputs what it sees, so the order is wrong.

ਿਸੱਖਸਿੱਖ

Nukta drifts after the vowel

A nukta (਼) belongs immediately after its consonant (consonant → nukta → vowel). OCR often emits the vowel first, which Unicode normalisation will not fix.

ਖਾ਼ਸਖ਼ਾਸ

Aspirated pairs look alike

Handwritten ਕ/ਖ, ਗ/ਘ, ਪ/ਫ differ by a single stroke and are frequently swapped.

How gurmukhifix fixes it

Examples

Raw OCRgurmukhifixWhat happened
ਿਸੱਖ ਧਰਮਸਿੱਖ ਧਰਮSihari reordered after its consonant
ਖਾ਼ਸ ਗੱਲਖ਼ਾਸ ਗੱਲNukta moved before the vowel sign
ਸਤਿ ਸ੍ਰੀ ਅਕਾਲਸਤਿ ਸ੍ਰੀ ਅਕਾਲAlready correct — left untouched

Why this beats the alternatives

vs. Tesseract alone

Tesseract turns pixels into characters. It has no linguistic knowledge — it can't know that a dependent vowel may not begin a word, or that a sign written to the left of a letter must be encoded after it. gurmukhifix adds exactly those rules.

vs. find-and-replace / spellcheck

A blind substitution table rewrites correct letters too and corrupts good text. gurmukhifix is evidence-gated: a fix is applied only when it makes the text more valid, so already-correct Unicode is never changed.

vs. doing nothing

Raw OCR often looks right but is malformed Unicode — wrong code-point order, dropped marks. That silently breaks search, indexing, fonts and copy-paste. gurmukhifix produces canonical, well-formed text.

Try Gurmukhi in the playground →