I've been playing around with OCR software lately, Tesseract, gOCR, and Ocropus. I'd like to get all the developers together in a room and lock them in until they come out with something awesome. Each program has features that I'd like to see in combined package, but for now I'll work with what I have.
Anyway, this post is a bit of a tangent to the whole goal of OCR, recognising text. Thinking about how to make the job of an OCR program harder can lead to a deeper understanding of the recognition process. The leading technology to beat OCR is the captcha. Those annoying little blurred words you have to read to gain access to forums and other sites. They're there to prove you're a human and not a spam bot. Through a combination of geometric distortions and filters it makes text hard for computers to read but not humans.
Anyway, for the hell of it I thought it would be nice to be able to generate them from the command line. So here's what I came up with. You'll need imagemagick installed as well.
I've put everything together in a script located here.
Plain text is generated first.
convert -background white -fill black -font FreeSerif-Bold -pointsize 36 label:'all work and\nno play\nmakes Grant\na dull boy' test.png
convert test.png -background white -wave 4x55 test2.png
convert test2.png -blur 0x1 test3.png
The photocopy filter was found at www.imagemagick.org/discourse-server/viewtopic.php?f=1&t=14441&start=0
convert test3.png -colorspace gray -contrast-stretch 4%x0% \( +clone -blur 0x3 \) +swap -compose divide -composite -blur 0x1 -unsharp 0x20 test4.png
|Photocopy effect added|