What is ABCocr?
ABCocr is a .NET Optical Character Recognition (OCR) product. You use ABCocr .NET to extract text from images.
ABCocr .NET is based around industry standard OCR software. At its heart is a custom version of the Tesseract 3 OCR engine.
The Tesseract OCR engine was originally developed by Hewlett-Packard UK. It was one of the top three engines in the 1995 UNLV Accuracy test and is probably one of the most accurate open source OCR engines available. Since then it has been extensively revised with sponsorship from Google.
Tesseract supports English, Spanish, German, French, Italian, Portuguese, Arabic, Bulgarian, Catalan, Chinese (simplified), Chinese (traditional), Croatian, Czech, Danish (Fraktur script), Danish (standard), Dutch, Finnish, Greek, Hebrew, Hungarian, Indonesian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Polish, Romanian, Russian, Serbian, Slovak (Fraktur script), Slovak (standard), Slovenian, Swedish, Tagalog, Thai, Turkish, Ukrainian, and Vietnamese. Tesseract can be trained to work in other languages as well.
So why wouldn't I just use Tesseract? What does ABCocr .NET add?
- 100% Stable. The original Tesseract is based around a command line process which means that it does not matter if it occasionally terminates, crashes or leaks memory. If you are running a modern in-process application you absolutely cannot have this type of behavior. ABCocr resolves these issues and presents you with a 100% stable platform.
- 100% Performant. Because Tesseract was based around a command line process it cannot multithread. ABCocr adds multithread support so you can spread load over multiple CPUs or cores and you can use it safely from multithreaded APIs like ASP.NET.
- 100% Compatible. Tesseract is 32-bit process and cannot be used in 64-bit applications. This is a significant issue when so many operating systems are now based around 64-bit address space. ABCocr eliminates this restriction and allows you to run in either x86 or x64 mode completely automatically.
- 100% Consistent. Tesseract is somewhat idiosyncratic. If you've ever seen error messages telling you that your TIFF tags are in the wrong order you will know what we mean. ABCocr eliminates this idiosyncrasy and provides a simple and uniform way of dealing with OCR.
- 100% Simple. We only have one example. Why is this? Well because it's so simple to use we couldn't think of anything else that you would need.
In terms of the class structure there is an OCR class which provides methods for assigning images to be processed.
The results come back as a Page object which contains a list of Word objects.
Each Word object contains a list of Blob objects which, broadly speaking, correspond to characters.
All these objects have text and bounds so that you can work out where they are in the image.
It really is as simple as that!