This property determines the language for OCR.
Each language has a unique set of characters and words. The set
of characters and words is used to train Tesseract in the types of
content that it might find. So if Tesseract is looking at documents
with an English training it will be expecting to see different
content than if it was looking at it from a German perspective.
The language property is a character code that directly relates
to training files in the tessdata folder. For example if the
language is "eng" then Tesseract will be looking for files such as
"eng.traineddata" in the tessdata folder. The language codes are
generally taken from the ISO 639 standard. For
example:
- eng - English
- chi_sim - Simplified Chinese (Mainland China)
- chi_tra - Traditional Chinese (Hong Kong and Taiwan)
- deu - German
- fra - French
- heb - Hebrew
- ita - Italian
- jpn - Japanese
- kor - Korean
- pol - Polish
- por - Portuguese
- rus - Russian
- spa - Spanish
- tha - Thai
ABCocr ships with setups for many languages - see the tessdata
folder for details. However to save space we do not include the
Chinese, Japanese, Korean, Thai, Vietnamese, Arabic and Hindi
trained data. If you need these files you can download them from
the Tesseract web site.
If you wish you can create your own training files for your own
languages and text. See the Tesseract web site for details of how
to do this.
|