DNS Lists, Greylisting and other SPAM related stuff

We all hate receiving SPAM. It’s simple, someone sends you a load of junk mail (be it the common VIAGRA spam or more targeted SPAM from companies using mailer lists) and you receive it and you spend about 10 seconds to decide it’s junk before deleting it. So today ISPA had a “conference” of sorts, called the SpamJam II. Quite a few questions were raised, and some answers provided. And actually some stuff which I’ve been thinking about the last couple of days was under heavy discussion.

Quantifying the effects of SPAM

One of the issues that was raised was how do you quantify the effects of SPAM on the economy of a country (And then South Africa specifically). It’s interesting to know that one of the Tier 1 ISPs in South Africa invested 20 million just on anti-spam infrastructure. This does NOT include additional routers, additional switches or additional pipes that’s required to handle the additional bandwidth. Obviously this costs gets pushed down to the consumer (via the Tier 2 ISPs usually, but it gets there). Thus general internet prices gets pushed up, ISPs have to put in more CPU, more RAM, more hard drive space, pay for more bandwidth, companies spend lots of time deleting unwanted email, we get false positives, all in all – it’s a mess.

Part of the problem is that the ECT act allows for sending of unsolicited mail as long as an opt-out is provided. The problem with this is dead simply that even though you can opt-out of a list you can always land back on that list again.

So how do you quantify the cost of SPAM? It was pointed out that doing this is actually very near impossible, and I have to agree. Estimating the bandwidth costs that would have been incurred if it was not for measures such as DNS BLs and greylisting is nearly impossible. Whilst we can say with a reasonably certainty how many emails we’re blocking we cannot say how big those emails would have been. Also keep in mind that with legit mail coming in at say around 5Mbps for a smallish ISP with SPAM on top it can easily peak out at over 6 or 7 Mbps (and the value differs with so little because SPAM messages generally are not particularly big). Now you’re also running into issues where you’ve at least temporarily HAVE to store that message (be it in RAM or on disk) in order to scan it, wasting both storage and CPU. Keeping things in RAM is all good and well, but it’s simply insane to think that as the system administrator for a smallish ISP I’m currently blocking messages at a rate slightly over 1 per second! More than 70 % of that is without even looking at the body of the message, but instead simply considering the triplet consisting of the return path, the recipient and the connecting mail server.

DNS Lists

These are a commonly accepted mechanism of blocking known spammers. The problem with these lists are a relatively high rate of false positives, consider for a moment the case of SAIX where smtp.saix.net lands on DNS BLs more often than what I can keep track of!

So we really need a way to identify “good” servers and when they do land on the blacklists (such as spamcop), ignore it. This can be done on a local configuration. However, what exim provides for and what has been proposed has been to create a whitelist, and the person that came up with this proposal currently runs this list and accepts all mail from these servers and it seems to work for him.

Greylisting

Greylisting is another mechanism that is commonly used to fight dynamic IPs and drop-and-run spam bots. It serves another purpose however, it gives dns bls time to kick in. I’m not so sure if that would hold if everybody was running greylisting, and I’m not, atm running greylisting mostly for other reasons. These can probably be summarized (and yes, the details is rather technical) as follows:

  1. Clients expect email to be near instantaneous – even when the protocol inherently is not.
  2. It makes it difficult for others to perform sender-callout verifications against my domains.
  3. DB resource consumption – based on my logs I’d be running a pretty big database.

Well, that, and I simply haven’t found enough motivation to find a sane way of doing it yet. Yes, there exists daemons and stuff out there but I simply haven’t had the time to weed the stuff out (and being me, I like to build the solutions myself and have full control over them).

Plus, I now have some additional ideas that might not be feasible from the existing tools.

Different levels of DNS Listing

Tacling the problem of incorrectly listed ISPs is actually pretty easy if you simply consider a “whitelist”, so in your MTA you can say “block mail that appears in any of my blacklists, but in none of my whitelists”. However, I think we can do better than that, and this has been discussed to an extent at the SpamJam today. There has been lots of arguments around this, but I’ve effectively identified 5 levels at which I’d like to see lists:

  1. Full-relay – fully trusted, and you’re willing to bet your own reputation on that.
  2. Full trust – accept mail from these hosts without using any “expensive” anti-spam mechanisms.
  3. Known MTAs – we’ll accept mail from these hosts even if they appear in blacklists.
  4. default – This is actually where servers are by default, if they appear nowhere else.
  5. blacklist – known spammers. No mail from these hosts will be accepted.

I’m not too worried about 5 (The likes of SORBS and spamcop does an excellent job at finding and listing these). Number 1 is also usually site-specific and should be restricted locally anyway (ie, there is no point in publishing this in a public list). Full trust is also highly debatable, but for the moment, let’s assume it’s all good. The problem here is that ULS may have a mail server which it uses for it’s corporate mail and it may well be trustworthy – but not everybody may feel this way regarding us. More on that later.

Essentially what I’m getting at is that when a mail server connects to us – those are the categories we want to get him into such that we can decide how stringent we are and which checks we’re going to employ. Other people may have (want) additional levels, but those are the ones that I think is reasonable (at least to get things going).

What constitutes a full relay? Well, say I’m running a gateway for a network then that LAN will probably fall into this category. Some checks that you may want to perform:

  • sender callout verification (at least ensure that the domain exists, but preferably a full sender verify callout).

At the full trust we’re basically saying we’ll accept all incoming mail from that server without the need to perform any checks. Thus a transmitting entity needs to be highly trusted to get onto this list. In fact, this (in a way) is just as trusted as relay. Essentially we’re saying you’re good for sending mail to us without having to go through expensive (will be clarified in a second) tests. Again, whether or not you’re going to be performing sender-callout-verification or not is up to you, but you should at this level at least start performing recipient callout verification.

Expensive tests is actually pretty easy to define: spamassassin. Currently spamassassin constitutes around 80 % of my CPU consumption on my mail servers. If I don’t have to let mail go through spamassassin that’ll improve capacity a LOT.

Known MTAs are things like public relays (eg, smtp.saix.net, smtp.isdsl.net, smtp.vodacom.co.za and the like). These _will_ pass greylisting checks, so there is no point in putting them through this. It was pointed out that greylisting does have a secondary purpose though: It gives the DNS blacklists time to pick up on an IP that should be listed. This is, however, not a concern for me as I don’t want these known MTA servers to end up on blacklists in any case. At his level though I do want to perform expensive tests such as putting the blokes through spamassassin.

If a host is blacklisted (and not “whitelisted” via one of the above lists) then I am going to block them at this point in time – they’re not known good, which means you’re not trusted, you ending up on a blacklist probably indicates that you’re bad – therefor I reason that’s it’s a reasonably pro-active measure to protect my network.

Now, for the rest of the millions of hosts not listed, it’s actually pretty simple: first I want you to pass greylisting, for a couple of reasons:

  • I want to encourage you to either get listed as an MTA by delaying your mail;
  • If you’re a drop-and-run mailbot I want you to bugger off.
  • I want to give the DNS BLs time to pick up on you (hopefully you’re going to hit a spamcop controlled booby trap before you come back for re-delivery) if you’re a slightly more intelligent bot.

I reason that there are still a lot of hit-and-run mail bots out there which will get caught by this, and those that don’t, oh well, hopefully spamassassin will catch them.

Once greylisting has been passed I’m willing to at least look at your mail.

Note that this should start reducing the amount of mail going off to spamassassin by a measurable margin, whilst at the same time not impacting heavily on the accuracy of decisions (and quite probably allowing for improving the accuracy).

Using greylisting for skipping expensive tests

This idea is actually pretty simple. greylisting itself allows us to determine whether a triplet (return-path, recipient, connecting-server) has been seen before, if not, we delay that triplet by a certain time, and then to allow that pair permitting that we see mails moving there. Now, based on experience, when two people communicate (ie, non-spam) they will continually use the same servers to be sending mail whereas spam will probably not (or at least, hopefully not frequently enough for what I’m thinking, and I haven’t figured out what the thresholds should be yet).

If we can store the times for the last couple of messages hitting a certain triplet, then in theory we can determine whether there is continual communication between two parties (or at least, in the one direction). If we can set some threshold of say more than 5 messages in the last week constitutes solicited mail then we can get to the point that when we hit such a triplet we can exempt that particular mail from more expensive checks. This does not augment our lists above directly, but it gives some additional trust to servers that sends us mail from the same source continually. Whether this will function as expected remains to be seen.

Using greylisting to detect known MTAs

The data from the greylisting can also be used to construct a list of known MTAs. Not MTAs that I’d like to bypass blacklist checks, but only bypass the greylisting delay. I’d like to record the triplets for all cases of known MTA downwards in any case for the purposes of the above. This does actually bring in a sixth level between known MTA and default called something like “no greylisting”, although, “known MTA” suits this better, and perhaps call the “known MTA” above “known good MTA”.

What would be the criteria for automatically listing a server in this range? Well, actually dead simple:

  • Received at least x number of messages from the server in a certain period.
  • Never failed greylisting during that period.

The question becomes – since we’re listing automatically, should hosts also come off that list automatically? And if so – how? The problem is that once it’s listed it’s impossible to fail greylisting in order to get delisted, so I reckon some shortish timeout of no mail from the particular host constitutes a delisting as you’re no longer a known MTA. Going from “known MTA” to “known good MTA” should imho be a manual procedure, preferably requiring registration. The known good MTA can be a centrally maintained list, I’m not so sure whether this holds for known MTA.

Categorizing entities owning the servers or the servers themselves

As mentioned earlier we may well need finer control over which types of corporates we want to trust. So typically I’d suggest categorizing the different entities owning the servers into lists such as:

  • ISPA registered ISP
  • registered ISP
  • Banking institution
  • Medical facility
  • etc …

Since the only fully automatic listing is the “known MTA” this is fine and the above will be totally separate lists from the trust-level. I’m not yet convinced as to the usefulness of this, except maybe that one can now register the ISPs in one list, and marketing companies in another – so whilst they may own “known MTAs”, or even “known good MTAs” I can still inform my mail server to not fully trust those MTAs. Or if banks gets listed as full trust (in other words, they’ve got things like abuse in place, and the server is known to be a corporate-only outbound) I can still choose to not fully trust them.

I can’t imagine that much more than this would be overly useful. The only thing there needs to be stringent control on is listing “known good MTAs” and “full trust” servers, full trust should only be afforded to companies that respond to spam notifications, has a working abuse and postmaster address and has supplied alternative contact info (or these can be obtained). Ideally these organizations should get themselves listed, but since this won’t happen it will (at least initially) be a rather manual process.

Well, I guess it’s time to start coding …

One Response to “DNS Lists, Greylisting and other SPAM related stuff”

  1. […] – DNS Lists, Greylisting and other SPAM related stuff saved by ga42009-04-08 – Spam Filtering techniques saved by janrigter2009-04-03 – Если […]