اُ

Urdu اُردُو

Urdu in the Nasta'liq style — a connected, right-to-left script.

Where it's used

Urdu administrative and literary records from north-west India are written in Nasta'liq, a flowing connected script that is hard for OCR. Available through the Python package.

What Tesseract gets wrong

Nukta placement changes meaning

A single nukta distinguishes ب/پ and د/ذ — different letters, different words. Misplacing it corrupts meaning, not just shape.

Hamza carrier ambiguity

A standalone hamza (ء) needs the right carrier (أ, ئ, …) depending on context.

Connected-letter breaks

Tesseract can split a single connected glyph into separate letters with spurious spaces.

How gurmukhifix fixes it

Applies nukta-placement rules for the common confusable pairs.
Recovers the correct hamza carrier from context.
Optional, evidence-aware rejoining of broken connected letters (off by default so real word spaces are never deleted).
Normalises to NFC.

Examples

Raw OCR	gurmukhifix	What happened
`اردو زبان`	`اردو زبان`	Valid — passes through
`محبت کا پیغام`	`محبت کا پیغام`	Valid — untouched

Why this beats the alternatives

vs. Tesseract alone

Tesseract turns pixels into characters. It has no linguistic knowledge — it can't know that a dependent vowel may not begin a word, or that a sign written to the left of a letter must be encoded after it. gurmukhifix adds exactly those rules.

vs. find-and-replace / spellcheck

A blind substitution table rewrites correct letters too and corrupts good text. gurmukhifix is evidence-gated: a fix is applied only when it makes the text more valid, so already-correct Unicode is never changed.

vs. doing nothing

Raw OCR often looks right but is malformed Unicode — wrong code-point order, dropped marks. That silently breaks search, indexing, fonts and copy-paste. gurmukhifix produces canonical, well-formed text.

Use via the Python package →