News: News Archives
http://www.aaas.org//news/releases/2008/0815documents.shtml
Science: Internet Security Program as an Archaeological Tool
A CAPTCHA is a distorted string of numbers or letters that must be read and typed, acting as a security measure on the World Wide Web. You might have solved a CAPTCHA before in order to gain entry into a secure website such as an email provider, ticket seller, social network, or blog. Now, researchers have modified the basic algorithm behind this online security program to help recognize words from faded texts that computerized optical character recognition programs are unable to decipher.
This new program, reCAPTCHA, was developed by Luis von Ahn and colleagues, and is currently in use by over 40,000 websites. It captures the efforts expended by human users all over the world, who collectively type more than 100 million CAPTCHAs each day. In this way, the program capitalizes on a task that only humans can perform, and computers still can not.
The reCAPTCHA program is highlighted in the 15 August issue of Science, the journal of AAAS.
Basically, in an effort to preserve human knowledge and to make information more accessible to the world (as well as to make a profit), physical books and other texts are being digitized en masse. But the numbers and letters on a page are often faded or otherwise obscured, especially since many of these texts are old, worn, and out-of-print.
Specialized character-recognition computer programs scan the physical documents and create bitmap images of the text. From these images, the programs can often determine the intended message and re-create the actual text in digital form. However, this technology is far from perfect, and on average, the programs fail to recognize 20% of the text they convert to images. This is where reCAPTCHA comes into play.
When a particular word, scanned from text, is deciphered differently by two different character-recognition programs, that word is then marked as "suspicious." The program reCAPTCHA then combines this suspicious word with a known "control" word and presents both to computer users on the Web in the form of a CAPTCHA. If the human user deciphers the control word correctly, then the user's guess at the suspicious word is labeled as a plausible guess. When three human users decipher suspicious words the same, then that word is verified and becomes a control word.
The vocabulary bank of reCAPTCHA's control words consists of more than 100,000 items, so any computer program that randomly guesses a word would still only succeed once out of 100,000 attempts. Furthermore, any computer program that could crack reCAPTCHA's system would represent an improvement over state-of-the-art optical character recognition programs, and could be incorporated to improve the entire transcription process.
Currently, the number of suspicious words identified by human users with reCAPTCHA stands at about 4 million each day, with over 440 million suspicious words transcribed to date. reCAPTCHA exploits the superior performance of humans in reading distorted text, and represents an archaeological tool that is contributing to the collection and digitization of human knowledge. It also seems to be growing in popularity.
The creators of reCAPTCHA view the program's effectiveness as proof of concept for a more general idea as well: that otherwise "wasted" human effort can still be harnessed and utilized to solve problems that computers cannot.
15 August 2008
