Yeh variants
Persian ye (ی) is frequently encoded as Arabic yaa (ي) or alef maqsura (ى), breaking search.
Persian (Farsi) — Arabic-script with Persian-specific letters.
Persian was the court and administrative language across much of north-west India for centuries. Its records mix Persian letters that OCR often maps to the wrong Arabic forms. Available through the Python package.
Persian ye (ی) is frequently encoded as Arabic yaa (ي) or alef maqsura (ى), breaking search.
ک and گ differ by a small stroke and are routinely swapped.
پ چ ژ گ are often misread as their nukta-less Arabic equivalents.
| Raw OCR | gurmukhifix | What happened |
|---|---|---|
زبان فارسی | زبان فارسی | Valid — passes through |
کتاب مطالعه | کتاب مطالعه | Valid — untouched |
Tesseract turns pixels into characters. It has no linguistic knowledge — it can't know that a dependent vowel may not begin a word, or that a sign written to the left of a letter must be encoded after it. gurmukhifix adds exactly those rules.
A blind substitution table rewrites correct letters too and corrupts good text. gurmukhifix is evidence-gated: a fix is applied only when it makes the text more valid, so already-correct Unicode is never changed.
Raw OCR often looks right but is malformed Unicode — wrong code-point order, dropped marks. That silently breaks search, indexing, fonts and copy-paste. gurmukhifix produces canonical, well-formed text.