Saturday, September 5, 2015

Researching on OCR (Optical Character Recognition) library

Recently I got a project request to do an auto web form submission program that will submit auto submit data as follow :



I want to build a desktop program so that it's easier for the user to run and do the auto submission. I choose the Dot Net framework as the platform to work on.

As you can see from the above image, there is a captcha image generated every time this page is accessed. So naturally the solution is to go to OCR library. From my experience, Tesseract is one of the best solution out there. However, it will be difficult to use it directly as it is developed in C / C++ language.

So is there any convenient wrapper that provide a more direct usage of the library? The answer is yes, there are several of them. But I am not going to go through all of them, rather, just the one that I tested working, it's :
https://github.com/charlesw/tesseract/tree/release/2.4.0

The github project mentioned these :

1. Add the Tesseract NuGet Package by running Install-Package Tesseract from the Package Manager Console.
2. Ensure you have Visual Studio 2012 x86 & x64 runtimes installed (see note above).
3. Download language data files for tesseract 3.02 from tesseract-ocr and add them to your project, ensure 'Copy to output directory' is set to Always.
4. Check out the Samples solution ~/Samples/Tesseract.Samples.sln for a working example

Of course they are correct on the above instructions and I did all of them, and I really appreciate their contributions. I open the solution file in SharpDevelop 4.4 and tried on the Tesseract.ConsoleDemo project, which is a console program. Seems like there is "something" missing when I tried to build the console program, the reference to "Tesseract.dll" is missing !!

Why? I asked myself, there must be something I missed out. Well, indeed I missed out a big chunk, and I blamed myself for that, Tesseract is a C / C++ library, and I am stucked at the DotNet mode at that moment, I can't expect it to be cross platforms like DotNet or Java, C / C++ is platform dependent !! I have to compile my own dll, stupid me :D

Knowing what's missing is a major help, it's not difficult to find that the "build.bat" is right there at the beginning. So I launch my favourite command prompt gnuwin32 to start running the "build.bat" file. Then come another difficulty, it failed to compile with the following message, as you can see below :

Text Output 

Project "C:\Projects\Temp\Http_Submit(OCR)\tesseract-release-2.4.0\tesseract-release-2.4.0\build.proj" on node 1 (default targets).
PrepareBuild:
  Copying file from "C:\Projects\Temp\Http_Submit(OCR)\tesseract-release-2.4.0\tesseract-release-2.4.0\src\AssemblyVersionInfo.template.cs" to "C:\Projects\Temp\Http_Submit(OCR)\tesseract-release-2.4.0\tesseract-release-2.4.0\src\AssemblyVersionInfo.cs".
  copy /y "C:\Projects\Temp\Http_Submit(OCR)\tesseract-release-2.4.0\tesseract-release-2.4.0\src\AssemblyVersionInfo.template.cs" "C:\Projects\Temp\Http_Submit(OCR)\tesseract-release-2.4.0\tesseract-release-2.4.0\src\AssemblyVersionInfo.cs"
C:\Projects\Temp\Http_Submit(OCR)\tesseract-release-2.4.0\tesseract-release-2.4.0\build.proj(55,9): error MSB4062: The "MSBuild.ExtensionPack.FileSystem.File" task could not be loaded from the assembly C:\Projects\Temp\Http_Submit%28OCR%29\tesseract-release-2.4.0\tesseract-release-2.4.0\tools\MSBuild.ExtensionPack\MSBuild.ExtensionPack.dll. Could not load file or assembly 'file:///C:\Projects\Temp\Http_Submit%28OCR%29\tesseract-release-2.4.0\tesseract-release-2.4.0\tools\MSBuild.ExtensionPack\M
SBuild.ExtensionPack.dll' or one of its dependencies. The system cannot find the file specified. Confirm that the <UsingTask> declaration is correct, that the assembly and all its dependencies are available, and that the task contains a public class that implements Microsoft.Build.Framework.ITask.
Done Building Project "C:\Projects\Temp\Http_Submit(OCR)\tesseract-release-2.4.0\tesseract-release-2.4.0\build.proj" (default targets) -- FAILED.


Build FAILED.

"C:\Projects\Temp\Http_Submit(OCR)\tesseract-release-2.4.0\tesseract-release-2.4.0\build.proj" (default target) (1) ->
(PrepareBuild target) ->
  C:\Projects\Temp\Http_Submit(OCR)\tesseract-release-2.4.0\tesseract-release-2.4.0\build.proj(55,9): error MSB4062: The "MSBuild.ExtensionPack.FileSystem.File" task could not be loaded from the assembly C:\Projects\Temp\Http_Submit%28OCR%29\tesseract-release-2.4.0\tesseract-release-2.4.0\tools\MSBuild.ExtensionPack\MSBuild.ExtensionPack.dll. Could not load file or assembly 'file:///C:\Projects\Temp\Http_Submit%28OCR%29\tesseract-release-2.4.0\tesseract-release-2.4.0\tools\MSBuild.ExtensionPack
\MSBuild.ExtensionPack.dll' or one of its dependencies. The system cannot find the file specified. Confirm that the <UsingTask> declaration is correct, that the assembly and all its dependencies are available, and that the task contains a public class that implements Microsoft.Build.Framework.ITask.

    0 Warning(s)
    1 Error(s)

Image Screenshot

I tried to find out the reason of failure, and failed countless times, and when it seems to be too big of a problem for me to be solve, then thanks to this person, I found the answer here, the parentheses escape problem, ahhh ... finally it make sense. Changing my project folder name from "Http_Submit(OCR)" to "Http_Submit_OCR" solved all compile error. Now it compile smoothly like below :


If you read until here and still reading, I bet you are either a programmer or you a lot of free time :) . So next step is the correct reference to the correct "Tesseract.dll".

After the configurations, this is my first attempt with the captcha orignal images (250 width X 30 height)



Not bad for first attempt, but there are still many errors in recognition.

Then in second attempt I added :

engine.SetVariable("tessedit_char_whitelist", "?.0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"); 
and have the following results :



Yeah, have some improvements with the special characters eliminated. Finally to increase accuracy, I added code to enlarge the image to bigger size and have better results :


Better improvements, but there are still some errors, so there are still rooms for improvements, any suggestions for further improvements are most welcome :)


PS: I am running on Windows 7 when doing my research.

You can always contact me at via my website : www.weeprogramming.com