ABCocr .NET OCR - Language Property

Language Property

Type	Default	Read Only	Description
[C#] `string` [Visual Basic] `String`	"eng"	No	The language that text is expected to be in.

Notes

This property determines the language for OCR.

Each language has a unique set of characters and words. The set of characters and words is used to train Tesseract in the types of content that it might find. So if Tesseract is looking at documents with an English training it will be expecting to see different content than if it was looking at it from a German perspective.

The language property is a character code that directly relates to training files in the tessdata folder. For example if the language is "eng" then Tesseract will be looking for files such as "eng.traineddata" in the tessdata folder. The language codes are generally taken from the ISO 639 standard. For example:

eng - English
chi_sim - Simplified Chinese (Mainland China)
chi_tra - Traditional Chinese (Hong Kong and Taiwan)
deu - German
fra - French
heb - Hebrew
ita - Italian
jpn - Japanese
kor - Korean
pol - Polish
por - Portuguese
rus - Russian
spa - Spanish
tha - Thai

ABCocr ships with setups for many languages - see the tessdata folder for details. However to save space we do not include the Chinese, Japanese, Korean, Thai, Vietnamese, Arabic and Hindi trained data. If you need these files you can download them from the Tesseract web site.

If you wish you can create your own training files for your own languages and text. See the Tesseract web site for details of how to do this.

Example

None.