In a story published yesterday, the New York Times explains Capthas, those wavy words Web-users have to type in before buying something or subscribing to something. You may have heard that what you’re actually doing when you type one of those is helping to transcribe unclear digitized text, but you probably didn’t know much about how it worked. And it turns out this is a service Google is using to verify the texts of scanned Google e-books. Here’s the NYT‘s explanation (Dr. von Ahn is the guy behind the technolgy):
Page images, particularly those printed before 1900, are loaded with smudges, stains, watermarks and crooked type, all of which give O.C.R.’s the fits. To fix the errors, Dr. von Ahn uses a number of programs, which when applied in the proper sequence magically transform troubled passages into easy-to-read prose.
The first step is done in-house. Two different O.C.R. programs scan the photographic image. Both will make mistakes, but not necessarily the same mistakes.
ReCaptcha flags as “suspicious” any word that is deciphered differently by the two programs or that does not appear in an English dictionary. The dictionary catches words that are misspelled the same way by both O.C.R.’s. Other programs examine the words on either side of the suspect word and make another guess based on that analysis.
Then each suspicious word is turned into a Captcha. It is crucial to understand that the Captcha is a distorted version of the word as printed in the original photographic image. It is not made from the O.C.R.’s imagined translation, which is often unintelligible. The unknown word is then paired with a second Captcha word whose correct translation is already known. This is the “control.”
Several Web users seeking entry to secure sites are then given both words and asked to decipher them separately.
A correct answer for the control word proves that the user is a human and not a machine. Answers for the unknown word are compared with the O.C.R. guesses and the context analysis. If the system is satisfied that the answer is correct, then the game is over.