Getting a grip on greylisting

So I’ve been wondering whether or not I should implement greylisting or not.  If I do, what’s the dangers and pretty much being rather nervous about the whole situation for quite a long while now.  Eventually one day (quite recently) I decided the amount of junk coming into my mail box again warranted some further action.  In particular I was getting back to the point where I was receiving 2 to 5 spam messages per day on a mailbox that has no spamassassin to protect it, only yes/no checks (the type that I prefer).

Right, so after pondering on it a bit I decided that I was not going to use one of the existing greylisting daemons (they seem to mostly plug into postfix and I’m using exim, plus it turns out I like doing things the hard way – meaning I get to understand what’s happening under the hood, get some control, and generally understand it better), but to rather implement it using mysql.  I’ve heard somebody say once this is a bad idea but decided to take the risk anyway, after all, I could just roll back if this caused the server to keel over and die.

Around an hour after making this decision I had my first prototype, about 40 lines of SQL code (and here I just have to test the wordpress code tags – version here is extremely wrapped for the sake of horizontal space):

CREATE FUNCTION `mail_greylist`(
    rcpt_address varchar(128),
    sender_address varchar(128),
    hostip int unsigned,
    nodelay int(1)
    IF nodelay THEN
        INSERT INTO mail_greylist(ip, recipient,
            sender, initial_attempt, last_success)
        VALUES(hostip, rcpt_address, sender_address,
            NOW(), NOW())
        ON DUPLICATE KEY UPDATE last_success=NOW();
        RETURN "PASS";
    END IF;
    SELECT initial_attempt, last_success
        INTO init, success FROM mail_greylist
        WHERE recipient=rcpt_address AND ip=hostip
            AND sender=sender_address;
    -- We've not seen this before.
    IF IsNULL(init) THEN
        -- The duplicate key is a safety net (it will probably never get used).
        INSERT INTO mail_greylist(ip, recipient, sender, initial_attempt)
            VALUES(hostip, rcpt_address, sender_address, NOW())
            ON DUPLICATE KEY UPDATE initial_attempt=initial_attempt;
        RETURN "INIT";
    END IF;
    -- Retry within 14 min of initial attempt.
        RETURN "TIME";
    END IF;
    -- Last success was more than 30 days ago, ignore the success,
    -- set initial time to now and fail.
        UPDATE mail_greylist SET initial_attempt=NOW(), last_success=NULL
            WHERE ip=hostip AND recipient=rcpt_address AND sender=sender_address;
    END IF;
    -- We haven't had any success previously,
    -- and the initial attempt is more than 24 hours old.
    IF IsNULL(success) && init < DATE_SUB(NOW(), INTERVAL 24 HOUR) THEN
        UPDATE mail_greylist SET initial_attempt=NOW()
            WHERE ip=hostip AND recipient=rcpt_address AND sender=sender_address;
        RETURN "EXPIRE";
    END IF;
    -- All checks passed.  Let it through.
    UPDATE mail_greylist SET last_success=NOW()
        WHERE ip=hostip AND recipient=rcpt_address AND sender=sender_address;

So the plug-in works acceptably.  Horizontal space as per usual a problem.  The table creation looks as follows:

CREATE TABLE `mail_greylist` (
  `ip` int(10) unsigned NOT NULL default '0',
  `recipient` varchar(128) NOT NULL default '',
  `sender` varchar(128) NOT NULL default '',
  `initial_attempt` datetime NOT NULL,
  `last_success` datetime default NULL,
  PRIMARY KEY  (`ip`,`recipient`,`sender`),
  KEY `email_id` (`recipient`)

So all in all a pretty simple thing.  It can probably do with some improvement.  The usage in exim (somewhere in the rcpt to: acl looks like:

warn    set ACL_RCPT_NO_GREY = false
       !hosts =
        dnslists =
        set ACL_RCPT_NO_GREY = true
warn    set ACL_RCPT_GREYLIST = ${lookup mysql{ 
            SELECT mail_greylist('${quote_mysql:$local_part}@${quote_mysql:$domain}', 
                INET_ATON('${quote_mysql:$sender_host_address}'), $ACL_RCPT_NO_GREY)}}
       !hosts =
defer  condition = ${if !eq{$ACL_RCPT_GREYLIST}{PASS}}
      !hosts     =
       message   = Greylisting ($ACL_RCPT_GREYLIST).

So not something that’s extremely difficult to invoke.  The used RBL will be explained in a bit, as this was not initially available.  Essentially the second warn and the defer is where it all started out, the nodelay option also wasn’t available on the first draft (this only got added a few hours later when the impact of greylisting major ISPs became overly apparent, as well as certain braindead banks).

The table is quite simple really, basically we store when we saw the initial delivery attempt for a given triplet of “incoming ip”, “mail from” and “rcpt to”.  We don’t accept email within 14 minutes from this, and not after 24 hours if there hasn’t been previous success.  Previous success had to be within the last 30 days or we kick off the process again.

This works like a charm.  In fact, the load of this is near negligible, and if anything, it brought our overall load down.

Then what is the problems?  Well, a few things really:

  1. Temporary deferrals on one message causes another MTA to mark an entire destination mail server as deferred, and every temporary defer delays _all_ mail queued for that destination mail server by a few minutes/hours/days at a time, so if the quantity of mail is higher than the lowest delay time, all mail will eventually fail due to timeouts.  This is a huge risk for us.  Basically this comes down to something like “Via IS we receive more than one message every 15 minutes, therefor whenever email comes in from IS it’s going to be a previously unseen triplet, thus we’re going to keep on greylisting them”.  Tbe way around this is to either only greylist based on the incoming IP (which is too open since we could have a spammer work through a list against our mail server, taking him longer than the greylist period and getting spam in), or alternatively, realize that it’s a proper MTA, it’s GOING to retry, and just exempt it from greylisting (IS’s anti-spam measures on their public relay is in any case of such quality that I don’t even want to bother putting them through any anti-spam checks – wish I could say the same for the majority of other ISPs in this country).
  2. Some banks (Yes ABSA, I’m looking at YOU) don’t seem to retry at all.  Or if they do the retry period is so damn extended that for all practical purposes they could be treating temporary errors as permanent.  In fact, ABSA is the reason why we’ve extended the 12 hour max period to 24 hours.
  3. Users have come to expect email to be damn-near instantaneous.  If we force every new triplet to go via a minimum 25 minute wait period we’re going to lose clients by the hundreds.  This is not an option.  Exposing them to SPAM is not an option either, as Kevin adequately put it:  Don’t filter spam, and they complain, filter it, and they complain.  Get a single incorrect classification and they moan.  It comes down to the fact that people complain.  They like to complain.  Either way, delaying email by even 5 to 10 minutes on average is simply not an option.

And so we had to come up with a solution.  And fast.  I almost knew this was going to be needed even before I implemented greylisting, but I figured I would have a couple of weeks.  Turned out I had hours (even though it took a couple of days).  I had to find a way to detect retrying mail servers from reasonably major ISPs and exempt them from greylisting.  And thus was born.  It’s built from the greylisting table above, it essentially groups all the data by IP, ignoring bounces (sender=””) and all failures less than 48 hours old, or failures older than 7 days.  Any host that has delivered with at least 5 differing quadlets (I’m gathering info from four hosts currently, thus the receiving IP becomes the fourth item in the tuple) successfully, and had no failures in the 48 hour to 7 day period is listed.  Currently I have just over 1900 hosts in the list, including a number of mail servers from the likes of hotmail, google, IS, iBurst, ABSA, SAIX, Vodacom and a few others.

This 1900 hosts was selected from almost 17000 unique hosts that connected to my mail servers.  Only around 4000 of these actually had failures on the greylisting, the 1900 odd servers are the only ones to have made more than 5 deliveries, the other ~12600 hosts averaged 1.35 deliveries per host to date (slightly over a week).

I expect the list of 1900 hosts to grow a little more in the coming weeks, and then very, very slowly probably.  Since this list went active all the complaints died down, and the SPAM in my inbox reduced back down to a message every other day.  Based on the statistics I’ve gathered from the mail_greylist table around 40 % of the triplets over the first weekend failed, and obviously from the above we can say that 22 % of incoming hosts were spammers.  And that’s _after_ a few other checks that based on my logs already rejected around 85 % of incoming mail.  So this should bring us up close to 90 % of mail being rejected/dropped even before looking at the message body.  That is pretty darn scary.

2 Responses to “Getting a grip on greylisting”

  1. kdcoetzee says:

    for a week or 2 ALL mail problems was in the users/techies eyes related to grey listing. but now that the grey listing learnt and added the majority mail servers, the word Grey listing is quietly fading in the users vocabulary because the mail is as it should be “moer vinnig” and reliable and no cheap Viagra.

  2. This is a interesting post.

Leave a Reply

This blog is kept spam free by WP-SpamFree.