GlusterFS and courier-imap

On Friday evening we had a scheduled maintenance power failure in one of our main data-centers. Since we’re running GlusterFS on two servers, replicating between each other, everything is supposed to simply fail over to the other server, and continue on it’s merry way. Everything is configured to be in a redundant configuration. Unfortunately courier-imap seems to be triggering a use-case which GlusterFS cannot properly deal with.

The way maildir works updates are generated in the tmp/ folder, and then linked to the destination folder before being unlinked from tmp/. So let’s say we need to update a file cur/foo, then the following happens:

link(cur/foo, tmp/foo);
unlink(cur/foo);
... make changes to tmp/foo
link(tmp/foo, cur/foo);
unlink(tmp/foo);

The link()/unlink() sequence can also be rename(), which is supposed to do the same thing atomically.

On top of glusterfs if tmp/foo gets newly created … you end up with a new GFID for the file. If one of the bricks is down at the time, you end up with the same file having two different GFIDs when that brick does return, which is an impossible to solve split-brain for the containing directory. You can reference https://bugzilla.redhat.com/show_bug.cgi?id=1366849 filed after a previous discussion I had on the matter with Joe Jullian – probably one of the most knowledgeable GlusterFS users.

The problem is now consider this scenario, the basic configuration is a pure replicate (two bricks). Whilst both is up:

rm some_file    # results in both bricks removing the file
touch some_file # results in both bricks creating the file (same gfid).

Now, assuming that some_file exists, same GFID.

# brick goes down (power failure)
rm some_file    # only the "up" brick unlinks the file (along with the original GFID).
touch some_file # only the "up" brick creates the file (along with a new GFID).
# downed brick restores.

At this point some_file exists on both bricks with different GFID values. So when you try to access the file – how should GlusterFS know which one is correct? The answer is obviously that it doesn’t. For a human being the choice is obvious.

After a quick look at the code (the file causing most of my problems is courierpop3dsizelist – and all of them are courier-imap specific files), I found this function:

 static int savepop3dlist(struct msglist **a, size_t cnt,
              unsigned long uid)
 {  
    FILE *fp;
    size_t i;
 
    struct maildir_tmpcreate_info createInfo;
 
    maildir_tmpcreate_init(&createInfo);
 
    createInfo.uniq="pop3";
    createInfo.doordie=1;
 
    if ((fp=maildir_tmpcreate_fp(&createInfo)) == NULL)
    {   
        maildir_tmpcreate_free(&createInfo);
        return -1;
    }

That code creates a new file under tmp/ – which is soon to be linked into the original courierpop3sizelist’s place.

    fprintf(fp, "/2 %lu %lu\n", uid, uidv);
 
    for (i=0; i<cnt; i++)
    {   
        char *p=a[i]->filename;
        char *q;
 
        if ((q=strrchr(p, '/')) != NULL)
            p=q+1;
 
        fprintf(fp, "%s %lu %lu:%lu\n", p, (unsigned long)a[i]->size,
            a[i]->uid.n, a[i]->uid.uidv);
    }

That generates the content for the file under tmp/

    if (fflush(fp) || ferror(fp))
    {   
        fclose(fp);
        unlink(createInfo.tmpname);
        maildir_tmpcreate_free(&createInfo);
        return -1;
    }

If there were any errors, abondon the file under tmp/ and return an error ourselves.

    if (fclose(fp) ||
        rename(createInfo.tmpname, POP3DLIST) < 0)
    {   
        unlink(createInfo.tmpname);
        maildir_tmpcreate_free(&createInfo);
        return -1;
    }

Close the file, and rename it (replacing the existing courierpop3sizelist). This has the effect of removing the old file (on one brick only) and replacing it with a new file (and a new GFID), resulting in the split brain described above. Apparently one system call is all it takes.

    maildir_tmpcreate_free(&createInfo);
    return 0;
 }

Cleanup. I haven’t verified the other code paths for various other courier* info files but I’d bet they’re following similar file access patterns.

What can be done to fix this situation? I’ve got some ideas, but alas, I’m not 100% sure since I’m not that familiar with the GlusterFS code base and underlying functioning. In my bug report above I mentioned that I’ve seen files under .glusterfs/ with a link count of 1. This is significant because it’s never supposed to happen. Ever. The most likely cause is because someone unlinked the associated file from the brick itself.

But what if we were to cause it on purpose? Specifically to solve the case above. It would have saved me approximately 1500 manual fixes in the last few hours alone. Consider the above some_file example again. Using a shortened GFID of 112233 and I’ll assume that we’ve got two bricks: A and B. The way GlusterFS would store this file is as some_file relative to the brick location, with an attribute listing the GFID, as well as a hard link to .glusterfs/11/22/112233 also relative to the brick location.

Now, normally, when both bricks are up, if we unlink some_file, the link count for .glusterfs/11/22/112233 would go to 1 and we can safely unlink that file too. This is as I understand it the normal operation.

But what if the B brick is down, why can’t we on the A brick that is up, mark that GFID for healing and only unlink some_file? Once the B brick returns we will have the following files (assuming the new GFID for the newly created file is aabbcc:

 # On brick A:
 some_file (attr gfid: aabbcc)
 .glusterfs/aa/bb/aabbcc (link count 2, with some_file as the other hard link, marked for healing)
 .glusterfs/11/22/112233 (link count 1, marked for healing)
 
 # On brick B:
 some_file (attr gfid: 112233)
 .glusterfs/11/22/112233 (link count 2, with some_file as the other hard link).

I’m not sure how glusterfs handles files which exists but for which the .glusterfs (gfid) file does not (the rm .glusterfs/11/22/112233 case on Brick B above). So there really is one of two ways to approach this depending on what happens:

If some_file will then eventually be removed (it needs to heal from the other brick), then we can simply when the self-heal-daemon on brick A sees this, unlink .glusterfs/11/22/112233 from both bricks (assuming link-count on B is 2 and 1 on A). This will result in when some_file gets opened (stat or whatever) GlusterFS discarding the copy on brick B, healing brick B from brick A and we continue on. If this is not the case, then it becomes more involved. In this case we will have to note when a process tries to open some_file that it’s in split brain (this already happens, due to the GFID values mismatching). But what if in this case we checked the link count for both GFID files on both bricks, realize that it’s 1 on Brick A, and as implication (due to the GFID file existing on A) the copy on B has to be the correct one. Discard some_file on B and heal from A. At this point the link count on both bricks drop to 1 and can be safely removed.

There is unfortunately some corner cases here. We will have to check that the the GFID files are in fact NOT split. In other words, their meta data and content will need to match. Failing that it’s still possible that a modification of the file on B could have happened without As knowledge due to a double outage sequence (A+B is up. B goes down. A makes the rename. A goes down. B comes up. B modifies the file in place. A restores).

I also suspect it becomes more involved in cases where the link count of the file is >2. Ie, some_file is a hard link to another_file – in this case on each brick the link count would 3 (the extra .glusterfs GFID file). Again taking the above case we’d end up with:

 # On brick A:
 some_file (attr gfid: aabbcc)
 another_file (attr gfid: 112233)
 .glusterfs/aa/bb/aabbcc (link count 2, with some_file as the other hard link)
 .glusterfs/11/22/112233 (link count 2, with another_file as the other hard link, possibly marked for healing)
 
 # On brick B:
 some_file (attr gfid: 112233)
 another_file (attr gfid: 
 .glusterfs/11/22/112233 (link count 3, with some_file and another_file as the other hard links).

In this case it’s not obvious what to do. Which one was the one that was deleted and replaced?

There are probably a lot of other corner cases not considered here either. And a simple solution is probably out of the question. In the above case I can reason that 112233 GFID exists on both bricks, but the aabbcc GFID does not, therefor the aabbcc GFID is the correct one since the implication is that the file had to have existed on both servers before (hopefully in sync), so it must have been unlinked and replaced on A for that particular filename. But how can we know it wasn’t a sequence like:

 another_file exists on both servers (gfid 112233)
 # a goes down
 b hard links another_file to some_file
 # b goes down
 # a comes back
 creates a new file some_file (gfid aabbcc)
 # b restores

both gfid’s would be marked for heal, and there is no obvious (automatic) solution. The only clue is that gfid aabbcc only exists on A but not on B – which is probably indicative that A is the correct one.

Leave a Reply

This blog is kept spam free by WP-SpamFree.