Type Default Read Only Description
[C#]
string

[Visual Basic]
String
  "eng" No The language that text is expected to be in.

 

   

Notes
 

This property determines the language for OCR.

Each language has a unique set of characters and words. The set of characters and words is used to train Tesseract in the types of content that it might find. So if Tesseract is looking at documents with an English training it will be expecting to see different content than if it was looking at it from a German perspective.

The language property is a character code that directly relates to training files in the tessdata folder. For example if the language is "eng" then Tesseract will be looking for files such as "eng.traineddata" in the tessdata folder. The language codes are generally taken from the ISO 639 standard. For example:

  • eng - English
  • chi_sim - Simplified Chinese (Mainland China)
  • chi_tra - Traditional Chinese (Hong Kong and Taiwan)
  • deu - German
  • fra - French
  • heb - Hebrew
  • ita - Italian
  • jpn - Japanese
  • kor - Korean
  • pol - Polish
  • por - Portuguese
  • rus - Russian
  • spa - Spanish
  • tha - Thai

ABCocr ships with setups for many languages - see the tessdata folder for details. However to save space we do not include the Chinese, Japanese, Korean, Thai, Vietnamese, Arabic and Hindi trained data. If you need these files you can download them from the Tesseract web site.

If you wish you can create your own training files for your own languages and text. See the Tesseract web site for details of how to do this.

 

   

Example
 

None.