How to fight spam in a wiki?

On Friday, having to wait for a librarian to fetch the old articles I wanted to read, I spent a few minutes removing spam from the AEL wiki. This form of spam is very easy to spot because it’s always the same : <small> HTML tags enclosing 30 links and the text linking to these sites have well-known spam, adult-oriented words in it (see the end of MsSecurity page where I didn’t had time to remove spam).

After the librarian gave me my articles, I went back to my lab, thinking of a possible solution … This kind of spam is constant. Why not writing a simple software bot that will fetch each and every page on a wiki, check if there is some litigious content in it and then going to the next page or cleaning the content. This bot wouldn’t prevent spam but can act quickly after spam (e.g. if you launch it every hour/day/week with cron).

This afternoon, I thought I could not be the only one trying to find ways to fight spam on wiki. Indeed, fighting wiki spam already has a verb, “to chongq” (although it also includes retaliation), a Wikipedia page (event two) and many other dedicated pages.

Basically, there is three types of behaviour to fight spam: wiki-specific methods, general http/web methods and manual actions.

  1. Wiki-specific methods are add-ons to your wiki system that help prevent spammer to modify your wiki. For example, Wikimedia has its anti-spam features and a Spam blacklist extension, TWiki has a Black List plugin, etc. Once set up, you generally do not need to care about them (except to see if they are properly working, to update them, etc.).
  2. General http/web methods use general web mechanisms and/or special features independent from the wiki software you use. These systems are also automated, like Bad Behaviour, use of CAPTCHA images, use of the “rel=nofollow” attribute in link tags, etc.
  3. Finally, manual actions can be taken by any human: removing spam like I did, renaming well-known wiki pages like sandbox, etc. The only advantage of this method is that the human brain can easily adapt itself to new forms of spam. Otherwise, it’s rather time consuming …

Finally, I read that some spam bots are removing spam, but only a part of it. This is the kind of thing I would like to do, but it should remove all the spam. (But before this one, I should begin the simple, geek blog software).

Addedum on August, 21st, 2006: independently of this post, Ploum made an interesting summary of a post from Mark Pilgrim (this post looks rather old: 2002!). In his post, Mark Pilgrim sees two ways of fighting spam: club or lowjack solutions.

With a club solution, your wiki is protected against lazy spammers. Clubs are technical solutions that make it harder for spammers to vandalize your website/wiki/blogs/etc. The Club works as long as not everyone has it. Once everyone had clubs, spammers will think a little bit and update their software to circumvent most of your clubs. In conclusions, “the Club doesn’t deter theft, it only deflects it.”

With a Lojack solution, your wiki isn’t necessarily protected but spammers that will vandalize it will be traced back. “Although it does nothing to stop individual crimes, by making it easier to catch criminals after the fact, Lojack may make auto theft less attractive overall.”

My bot that completely removes spam is definitely not a lojack. But it’s not a club neither. This tool will allow you to be spammed and it will not trace spammers back. Still, it will be less attractive for spammers to add links on wiki since they will be removed soon after being added.

(Btw, I’ve just noticed that comments were automatically forbidden for any post. That was not intentional)