Same structural matra rules
Every Devanagari language shares the consonant + matra structure, so the same orphaned-matra checks apply.
ाक→flaggedThe shared base script behind Hindi, Marathi, Nepali and Sanskrit.
Use --lang devanagari for mixed or non-Hindi Devanagari material — Marathi, Nepali, Sanskrit. It inherits the Hindi rules and adds script-general ones.
Every Devanagari language shares the consonant + matra structure, so the same orphaned-matra checks apply.
ाक→flaggedSanskrit scans pick up spurious udatta/anudatta accent marks from speckle.
The avagraha (ऽ) is often a misread danda or mark.
extends.| Raw OCR | gurmukhifix | What happened |
|---|---|---|
भारत देश | भारत देश | Already correct — untouched |
मराठी भाषा | मराठी भाषा | Valid Marathi — passes through |
ाक | ाक ⚑ | Vowel sign at word start — flagged |
Tesseract turns pixels into characters. It has no linguistic knowledge — it can't know that a dependent vowel may not begin a word, or that a sign written to the left of a letter must be encoded after it. gurmukhifix adds exactly those rules.
A blind substitution table rewrites correct letters too and corrupts good text. gurmukhifix is evidence-gated: a fix is applied only when it makes the text more valid, so already-correct Unicode is never changed.
Raw OCR often looks right but is malformed Unicode — wrong code-point order, dropped marks. That silently breaks search, indexing, fonts and copy-paste. gurmukhifix produces canonical, well-formed text.