Punjabi OCR correction

Where it's used

Punjabi is the everyday language of the Punjab. Its config extends Gurmukhi, inheriting all of those rules and adding Punjabi-specific ones for loanwords and nasalisation.

What Tesseract gets wrong

Nukta letters for loanwords

Perso-Arabic loanwords use nukta letters ਸ਼ ਖ਼ ਗ਼ ਜ਼ ਫ਼. The nukta is easily dropped or misordered in OCR.

ਭਾਸਾ਼→ਭਾਸ਼ਾ

Tippi vs bindi nasalisation

The two nasalisation marks (ੰ tippi, ਂ bindi) are visually similar and context-dependent.

All Gurmukhi failure modes

Because Punjabi uses the Gurmukhi script, every sihari and aspirated-pair issue applies here too.

ਿਕਸਾਨ→ਕਿਸਾਨ

How gurmukhifix fixes it

Inherits the full Gurmukhi rule set via config extends.
Restores nukta order for loanword letters (ਸ਼ ਖ਼ ਗ਼ ਜ਼ ਫ਼).
Handles tippi/bindi nasalisation as evidence-gated corrections.
Reorders sihari and normalises to clean NFC Unicode.

Examples

Raw OCR	gurmukhifix	What happened
`ਿਕਸਾਨ ਵਰਗ`	`ਕਿਸਾਨ ਵਰਗ`	Sihari reordered
`ਭਾਸਾ਼ ਬੋਲੀ`	`ਭਾਸ਼ਾ ਬੋਲੀ`	Nukta moved before the vowel
`ਪੰਜਾਬੀ ਮਾਂ ਬੋਲੀ`	`ਪੰਜਾਬੀ ਮਾਂ ਬੋਲੀ`	Already correct — untouched

Why this beats the alternatives

vs. Tesseract alone

Tesseract turns pixels into characters. It has no linguistic knowledge — it can't know that a dependent vowel may not begin a word, or that a sign written to the left of a letter must be encoded after it. gurmukhifix adds exactly those rules.

vs. find-and-replace / spellcheck

A blind substitution table rewrites correct letters too and corrupts good text. gurmukhifix is evidence-gated: a fix is applied only when it makes the text more valid, so already-correct Unicode is never changed.

vs. doing nothing

Raw OCR often looks right but is malformed Unicode — wrong code-point order, dropped marks. That silently breaks search, indexing, fonts and copy-paste. gurmukhifix produces canonical, well-formed text.

Punjabi ਪੰਜਾਬੀ