Collecting a SPAM corpus

One of my employees pointed me at 10 Minute Mail again today. A quick discussion of the concept sparked a few ideas …

Firstly, the arguments for and against the 10 minute mail concept are pretty much a seesaw type argument, no matter what one party says, that exact same argument can most likely be used in the other way as well. So I’ll try and bullet-list a few ideas around it:

  • A spammer can easily use 10 minute mail to obtain addresses that will pass sender-callout verification. (greylisting can prevent this on the recipient side, but not everybody implements it.)
  • It’s very handy to prevent spam if you can sign-up somewhere with a non-persistent account.
  • She sheer quantity of mail mentioned on 10 minute email goes to show how effective email harvesting is, personally my email address is well protected and I can count that spam that I receive a week on my one hand, even with spamassassin switched off.

Now, what we really want to do is get a collection of as much spam as we possible can, if it’s to feed DCC filters, spamassassin bayesian training, I really don’t care what you’re after, the more spam we can obtain realtime, the better. Yes, other filtering methods are already excellent and has less risk than overly heuristic mechanisms, but not everybody is in the situation where they can control the configuration on the MX servers in order to take advantage of some of these checks. In fact, in my most effective checks only really cares about a few pieces of information namely the incoming IP address, the helo string, and the return path. However, in some cases heuristic mechanisms are required, and in these cases we need a very good corpus to train on. Good here referring to both quantity and guarantee about correctness.

There’s been a couple of tricks that I’ve used in the past, with varying degrees of success.

  1. accept all “relay attempt” emails on servers that aren’t normally used for relay servers (yes, this gets me a few messages per day).
  2. hide email addresses in html comments on web pages and all mail going here goes to the corpus (highly effective, on some of these traps I get up to 50 messages a day at the moment).
  3. Use the address on a few “registration” links for various dubious websites. The problem here is that you MIGHT actually get legit mail from mentioned dubious site from time to time.

The idea that cropped up with 10-minute email is actually brilliant, we’re not sure how effective this would be, nor do we really care, as it is in fact based on method three which can generate false hits, but basically the service could allow email for ten minutes, taking note of the addresses from which it receives mail, then after the ten minute barrier, any email coming from addresses that isn’t at least in the same domain as messages received in the first 10 minutes are more than likely spam.

One of these when I have time … I must build a proper honey pot for spam.

Comments are closed.