We've been looking into various anti comment-spam measures recently. Kieran has done good work implementing support for the Movable Type blacklist system, which means that we can now block much more comment spam than we used to. However, it's still not perfect; it's in the nature of these things that new variants of spam inevitably arrive slightly ahead of new updates to the blacklist file, so (as we've already seen) spam still slips through from time to time.
One other approach that's sometimes used to try and prevent spam is the Captcha, a graphic which shows some distorted letters and asks the user to type in the letters that they see; if the typed letters don't match the text in the image, then the comment is rejected. The theory is that since most spam is sent by computers running scripts, they won't be able to pass the captcha test, so they won't be able to post spam. At the same time as Kieran was looking at better blacklisting, I was looking into captchas. Here's what I learned:-
- I presumed "captcha" was just a catchy name for something that attempts to capture whether you're a human being or a script, but not so; it stands for Completely Automated Public Turing test to tell Computers and Humans Apart. Catchy. It's a trademark, too.
- One problem with captchas is that they are not in fact impossible for computers to solve. Various people have done work to demonstrate that captchas can be OCR'ed with very high success rates, though the success varies with the type of captcha. These people for example claim to be able to break the Yahoo captcha 92% of the time. Here's a list of common captchas with an assessment of their solvability.
- In general, making captchas harder for computers to solve also means making them harder to humans to read. Some people have done interesting work to make captchas which humans can read but computers would find difficult; here's an example of 3D lettering which (the authors believe) is human but not computer readable.
- Other people have come up with captcha variants which are not based around letters or numbers. Here's a captcha which draws shapes on a noisy background, here's one which asks you to click at the cente of distortion on a picture, and here's one which asks you to click on a shape which doesn't belong on its background.
- There's no strong evidence to suggest that spammers are actually using AI to solve captchas; it may in fact be cheaper just to use people to do the job; a human being can easily solve a hundred captchas an hour with a very high success rate.
- Cory Doctorow wrote about the whimsical idea that if you want a human to solve a captcha for you, you could do so by the ingenious approach of inviting people to come to your web site and offering them something they wanted – free porn or MP3s, say – and then protecting the content with a captcha which is in fact protecting some other site that you want access to. The visitor to your web site solves the captcha to get the porn, then you use the captcha to get whatever resource you wanted. It's ingenious, but again, there isn't any evidence to suggest that it's actually going on.
- Apart from being solvable in principle, if not in practice, the other big problem with captchas is that they aren't accessible. It's part of their design that they use tricky colour combinations, busy backgrounds, strange fonts and distorted lettering, so they are at best hard to read; if you're even slightly visually impaired, they quickly become unworkable for you. And since they're images, not text, you can't resize them or change their colour scheme.
- For that reason, some people have experimented with alternative means to protect their comment forms or sign-up forms or whatever. One clever chap has used Flash to make something which looks just like a comment form, but is in fact a little Flash applet which when you click on part of it submits the form together with a key which is buried inside the SWF file. The spambot has no idea that clicking on a special part of the Flash applet is how you submit this particular form, so wanders away disappointed.
- But my favourite, and in some ways most trivial variant on the captcha is one which is completely accessible and works like this; add one extra field to your comment box, and have a label next to it with a question such as "What is my name?" or "What colour is a banana?" or "What bird rhymes with carrot?". It's accessible, because the text can be resized, read out loud, etc. but it's hard for a bot to defeat because it isn't algorithmically solvable. Lots of people seem to have had this idea, but Eric Meyer's GateKeeper implementation is one that's often cited.
This last idea, it seems to me, is particularly well-suited to us because we could let everyone define their own question and answer, and make them as hard or as easy as they like. One special case of the Q&A system would be simply "Enter the password" with no clue as to what the password is; that way only people that you've actually told the password to can comment. And of course if bots start building up databases of common questions and answers, you can just change yours to something new.
If we can't beat spam by blacklisting, I like the QA idea.
Update 10th August: I'm tickled by this idea for asking users to enter large random numbers to help calculate pi. Nice blog design, too.