فا

Farsi فارسی

Persian (Farsi) — Arabic-script with Persian-specific letters.

Where it's used

Persian was the court and administrative language across much of north-west India for centuries. Its records mix Persian letters that OCR often maps to the wrong Arabic forms. Available through the Python package.

What Tesseract gets wrong

Yeh variants

Persian ye (ی) is frequently encoded as Arabic yaa (ي) or alef maqsura (ى), breaking search.

Kaf / gaf confusion

ک and گ differ by a small stroke and are routinely swapped.

Persian letters read as Arabic

پ چ ژ گ are often misread as their nukta-less Arabic equivalents.

How gurmukhifix fixes it

Canonicalises yeh and kaf/gaf variants to their Persian forms.
Restores Persian-specific letters (پ چ ژ گ).
Optional, evidence-aware joining repair for broken glyphs.
Normalises to NFC.

Examples

Raw OCR	gurmukhifix	What happened
`زبان فارسی`	`زبان فارسی`	Valid — passes through
`کتاب مطالعه`	`کتاب مطالعه`	Valid — untouched

Why this beats the alternatives

vs. Tesseract alone

Tesseract turns pixels into characters. It has no linguistic knowledge — it can't know that a dependent vowel may not begin a word, or that a sign written to the left of a letter must be encoded after it. gurmukhifix adds exactly those rules.

vs. find-and-replace / spellcheck

A blind substitution table rewrites correct letters too and corrupts good text. gurmukhifix is evidence-gated: a fix is applied only when it makes the text more valid, so already-correct Unicode is never changed.

vs. doing nothing

Raw OCR often looks right but is malformed Unicode — wrong code-point order, dropped marks. That silently breaks search, indexing, fonts and copy-paste. gurmukhifix produces canonical, well-formed text.

Use via the Python package →