Email Archiving

So recently this topic came up again in the office. And with clients. And I came to realize exactly how sticky this problem really is. The requirements companies generally has is something down the lines of:

We want all email to be archived, and we don’t want anybody to have access to it. Not even the mail admins, and yet, they want the archives to be available on demand.

The irony is that your email admins can probably do significantly more damage than you think. For example, it’s dead easy to BCC all incoming email from your CEO to him/herself.

So from the outset there are legal issues surrounding email archiving, when are you allowed to archive (monitor) and when not. Who’s allowed to have access to these archives and who not? To what extent does your policies cover your proverbial legal ass, and to which extent does your archive solution need to be immune from it’s administrators (without hampering their ability to perform their work). These types of questions are strictly speaking not even technical – and trust me, when it comes to legalize I’m the last person that should be asked about these things.

I prefer the technical side of this challenge. And when it comes to email archiving there’s a few.

First and foremost, there is the issue of actually making copies.  This I fortunately managed to wangle based on a post which can be located here – it basically describes something similar to what I need, but not quite entirely.  Instead I used the system_filter there and tampered a little.  Basically I only need to have all the emails on the system, and I do NOT want multiple copies of the same email on my system (eg, if someone sends to Kevin, Stephen and myself, I still only want one copy of the email in the archive).

So technical issue number one is easily solved, by creating a simple transport for dropping emails in a Maildir/ location (duplicated from another due to slight variances in settings, but this should look somewhat familiar for most):

archive_delivery:
  driver = appendfile
  maildir_format
  delivery_date_add
  user = mail
  group = mail
  mode = 0600

And then two global settings:

system_filter = /etc/exim/system_filter.exim
system_filter_directory_transport = archive_delivery

And the system_filter.exim file is reasonably obvious (and subject to change):

# Exim filter
if first_delivery
then
  if ${sender_address} is ""
  then
    unseen save /var/spool/mail/archive/.${length_7:$tod_log}.BOUNCES/
  else
    unseen save /var/spool/mail/archive/.${length_7:$tod_log}.${tr{${lc:${sender_address}}}{.}{_}}/
  endif
endif

finish

This still has many problems, as will be discussed shortly, but essentially what this does is create a maildir format archive, with sub-folders based on the year+month of the email, and then lastly the return path of the email, it uses BOUNCES in the case of the NULL return path. Yes, bounces too should be archived :).

The easy to spot problems is simple:

  1. What if you have a user archive on your system? (In my case not a problem as I store email in /var/spool/mail/${first_two_chars_of_username}/${username}).
  2. How do you know who the recipients of the email was?
  3. How do you quickly (and accurately) locate emails with specific subjects/recipients? (senders in this case is easy)
  4. How do you restrict disk storage requirements? (a 1MB attachment typically consumes 1.25MB once encoded in base64 for email transmission, and not to mention that many such attachments are sent in various email instances vir forwards etc …)

There are a few ideas around this. In my exim acl for verifying rcpt addresses I can build up a X-Archive-Recipients header, which I can obviously remove after the system filter has run, but before delivery to recipients happen. This will solve issue two above. This is still to be done. This doesn’t get us quick access though. Searching large amounts of text stored in files is SLOW.

This slots in with the issue regarding quick access to emails. So obviously in the long term not storing meta information regarding the emails in some kind of indexed database is out of the question. Courier-IMAP does vary filenames based on various status bits, so even that becomes a problem (Just don’t give access to the archive via courier-imap then is my instinctive response to this). However, there is a much bigger reason for not storing the raw emails in maildir format: Duplicate attachments.

(Note to readers: The following paragraph may be offensive to the guilty persons)

Consider quickly what happens when secretary X receives a 3MB jpeg image from her girl-friend from the other side of the globe. In all probability she goes “ah, this is so cute”, hits the forward button, copies half her address into the recipient list and hits send. Spamming probably another 20 to 40 people with the 3MB image (consuming 4MB of transport encoding). Now five minutes later one of the victims (same company) becomes another offender and also hits the forward button (by this time I’d love to get hold of the email if I was a spammer as there would be somewhere between 40 and a few hundred most likely legit email addresses embedded in the email). Now, assuming this stops there this 3MB image has now been through the server three times, for a total of 9MB if we stored the image in it’s native “raw binary” format, or 12MB with base64 encoding (typical in Maildir).

So by not using Maildir/ we can potentially save 9MB worth of storage for the above scenario. The process thus needs to break an email into it’s various MIME parts, whilst maintaining the mime container structure, as well as the individual part’s headers. Additionally, someone along the way may decide to rename the image, or even just use a different character set for the headers supplying the filename. Thus if we do intend to de-duplicate the emails we really aught to work on content hashes instead of things like filename and size information which is somewhat unreliable (and not all mime-parts has file names, and often the mime headers are badly broken – recalling an instance of email sent by a well known accounting package).

Another advantage of actually “smartly” processing the email is that we can keep a database of emails, dates, senders (return path + from + sender), recipients (not always the same as the To: and Cc: headers), and whilst we’re at it, the subject lines. This information alone would very quickly become huge, and in all probability we would want a full-text search on the Subject: lines. This is starting to sound like a relational database of sort if you ask me, but at the same time, the storing of the content itself doesn’t.

And that’s it for the moment. More ideas later, for now, you can build a very simple archive of all email passing through your exim mail system, and you can probably do a bit of grepping, and unless you have a very busy system, the above may well be good enough for you.

Tags: ,

Comments are closed.