PHP web application development company (Birmingham, UK)

tel: 0845 0046746

Spam detection / tagging for a forum

One customer has a forum which has been consistently plagued by spam – here’s a short writeup about what we found worked for us in an initial attempt at reducing the tidal wave of spam.

The obvious initial caveat is that whatever measures may be detailed within this post are unlikely to be applicable for other scenarios – spammers change tactics.

The site in question runs a WordPress based forum (bbpress) and receives a relatively high level of traffic (due to which, it’s presumably a tempting target for spam postings).

The forum was protected by Akismet, however even with Akismet, an unacceptable level of spam was getting through (Akismet reports that it’s stopped 106,000 spam in the last few weeks).

Initially we :

  1. Monitored the Apache access log – from a quick rummage it was easy to see the main pages involved were bb-post.php and bb-login.php (these were the main ones receiving POST requests), often from the same IP address in quick succession.
  2. We edited bb-post.php and bb-login.php and dumped the contents of the PHP GET and POST data to disk – so we could see what sort of requests they were receiving.

Based on the above, it was obvious that :

  1. A large volume of spammy POST requests originated in Asia, from a relatively small set of IP addresses.
  2. A proportion of the spammy comments being posted were in Chinese (yet the site targets English speakers)
  3. A proportion of the spammy comments were very large (i.e. >4kb) while most legitimate posts were relatively small (2-4 lines of text).
  4. Spammy comments often contained a high number of URLs

While initially it was tempting to go for the quick ‘kill’ and block subnet’s of users using an iptables rule, this wasn’t ideal as it could block legitimate users. The idea of having to maintain this pool of IP addresses wasn’t something to look forward to either – we’d need some way of identifying disreputable IP addresses/clients and automatically blocking them (perhaps with fail2ban in the future).

From experience with tools like SpamAssassin in the past, it seemed most sensible to adopt a scoring strategy with submitted posts – so, for example, we’ve added rules like the following :

  1. If the user agent matches a known pattern (E.g. IceWeasel on Debian), spam count += 1
  2. If the user’s IP is located in Asia (through use of a geoip lookup), spam count += 1
  3. If there are more than 3 URLs within the post content, spam score += 1
  4. If the user’s IP is listed in real time blacklists, spam score += 1

Given enough rules, and a reasonable threshold, the chance of wrongly identifying/tagging a post decreases. Now, if the post score is above a specific threshold (e.g. 5) we discard the request returning a non-descriptive error message.

We’ve implemented the request processing before the forum code itself runs/loads, and to minimise overhead, we only check HTTP post requests.

Appropriate logging of requests and data has allowed us to tweak the ruleset over the last 24 hours, to the extent that from a 12 hour period, we’ve identified 3501 spam comments, had one spam posting get through by mistake (which led to a rule update, and was caught by Akismet anyway) and so far, no false positives.

We’ve no intention of replacing Akismet – as it does an excellent job – but certain requests seemed to be easy to target – and by blocking them we can make a significant difference to the site’s performance and users.

, , , ,