Nukta letters for loanwords
Perso-Arabic loanwords use nukta letters ਸ਼ ਖ਼ ਗ਼ ਜ਼ ਫ਼. The nukta is easily dropped or misordered in OCR.
ਭਾਸਾ਼→ਭਾਸ਼ਾPunjabi written in the Gurmukhi script — it builds on every Gurmukhi rule.
Punjabi is the everyday language of the Punjab. Its config extends Gurmukhi, inheriting all of those rules and adding Punjabi-specific ones for loanwords and nasalisation.
Perso-Arabic loanwords use nukta letters ਸ਼ ਖ਼ ਗ਼ ਜ਼ ਫ਼. The nukta is easily dropped or misordered in OCR.
ਭਾਸਾ਼→ਭਾਸ਼ਾThe two nasalisation marks (ੰ tippi, ਂ bindi) are visually similar and context-dependent.
Because Punjabi uses the Gurmukhi script, every sihari and aspirated-pair issue applies here too.
ਿਕਸਾਨ→ਕਿਸਾਨextends.| Raw OCR | gurmukhifix | What happened |
|---|---|---|
ਿਕਸਾਨ ਵਰਗ | ਕਿਸਾਨ ਵਰਗ | Sihari reordered |
ਭਾਸਾ਼ ਬੋਲੀ | ਭਾਸ਼ਾ ਬੋਲੀ | Nukta moved before the vowel |
ਪੰਜਾਬੀ ਮਾਂ ਬੋਲੀ | ਪੰਜਾਬੀ ਮਾਂ ਬੋਲੀ | Already correct — untouched |
Tesseract turns pixels into characters. It has no linguistic knowledge — it can't know that a dependent vowel may not begin a word, or that a sign written to the left of a letter must be encoded after it. gurmukhifix adds exactly those rules.
A blind substitution table rewrites correct letters too and corrupts good text. gurmukhifix is evidence-gated: a fix is applied only when it makes the text more valid, so already-correct Unicode is never changed.
Raw OCR often looks right but is malformed Unicode — wrong code-point order, dropped marks. That silently breaks search, indexing, fonts and copy-paste. gurmukhifix produces canonical, well-formed text.