← All scripts
ਪੰ

Punjabi ਪੰਜਾਬੀ

Punjabi written in the Gurmukhi script — it builds on every Gurmukhi rule.

Where it's used

Punjabi is the everyday language of the Punjab. Its config extends Gurmukhi, inheriting all of those rules and adding Punjabi-specific ones for loanwords and nasalisation.

What Tesseract gets wrong

Nukta letters for loanwords

Perso-Arabic loanwords use nukta letters ਸ਼ ਖ਼ ਗ਼ ਜ਼ ਫ਼. The nukta is easily dropped or misordered in OCR.

ਭਾਸਾ਼ਭਾਸ਼ਾ

Tippi vs bindi nasalisation

The two nasalisation marks (ੰ tippi, ਂ bindi) are visually similar and context-dependent.

All Gurmukhi failure modes

Because Punjabi uses the Gurmukhi script, every sihari and aspirated-pair issue applies here too.

ਿਕਸਾਨਕਿਸਾਨ

How gurmukhifix fixes it

Examples

Raw OCRgurmukhifixWhat happened
ਿਕਸਾਨ ਵਰਗਕਿਸਾਨ ਵਰਗSihari reordered
ਭਾਸਾ਼ ਬੋਲੀਭਾਸ਼ਾ ਬੋਲੀNukta moved before the vowel
ਪੰਜਾਬੀ ਮਾਂ ਬੋਲੀਪੰਜਾਬੀ ਮਾਂ ਬੋਲੀAlready correct — untouched

Why this beats the alternatives

vs. Tesseract alone

Tesseract turns pixels into characters. It has no linguistic knowledge — it can't know that a dependent vowel may not begin a word, or that a sign written to the left of a letter must be encoded after it. gurmukhifix adds exactly those rules.

vs. find-and-replace / spellcheck

A blind substitution table rewrites correct letters too and corrupts good text. gurmukhifix is evidence-gated: a fix is applied only when it makes the text more valid, so already-correct Unicode is never changed.

vs. doing nothing

Raw OCR often looks right but is malformed Unicode — wrong code-point order, dropped marks. That silently breaks search, indexing, fonts and copy-paste. gurmukhifix produces canonical, well-formed text.

Try Punjabi in the playground →