GlusterFS experiences

Recently we’ve had some unexpected bad performance on a few servers that rely on GlusterFS. Specifically our web servers (Apache + PHP portion of our LAMP stack) took quite a beating with more outages than I’d care to admit.

As it turns out the problems we had were not actually GlusterFS but rather a kernel problem. Even so, I learned quite a bit about GlusterFS and would like to take the opportunity to share some of my experiences.

The first time I used GlusterFS was way back on version 3.4.2. Needless to say – it’s come a very long way since then and I’m pleasantly surprised. Impressed almost. Most of our clusters (we run two in-house, both 2 x pure replica, a 4 x replica for one customer which insists that even if three of their four nodes goes down completely they still want to be operational, and a few other 2 x 2 replica-distributes for other customers) are now on version 3.7.14. We haven’t yet made the jump to the 3.8 series and that is likely not going to happen until at least a few minor versions in.

As things stand in the aftermath of the kernel problem we picked up on a few optimization “tricks”, as well as some general pieces of advise as to avoid split brain, and a few split-brain scenarios that can be tricky to figure out, along with some coding tricks to deal with this. This is likely to span multiple blog entries.

It’s not always GlusterFS

We had quite a bit of back and forth, we tracked a “partially faulty” drive that would chow requests – we replaced that – things improved. A lot. But it was still nowhere perfect.

We tried to optimize GlusterFS using more information sources than I can count (never mind reference here). The latter issue is definitely a kernel problem I only realized at a later stage that our problems actually started when we upgraded the kernel on one of our data nodes. We later upgraded the other node too in order to make sure it’s not a known kernel problem that has been fixed. Turns out it was a regression and rolling back to the kernels we had pre-upgrade returned stability.

Getting there we moved apache+php logging off to an on-server partition (we normally log from multiple servers to the same log file – making live easier wrt access logs and error logs for customers). We will likely re-enable that during the course of next week. Including processing of access logs using webalizer. Time-sync is done using ntp and so out-of-order time stamps in the logs files should not be a problem (we’ve never encountered any). The motivation? Well, less data flowing through FUSE and Gluster should result in better performance right? Well, it did help. A little. Not much.

We suspect the problem relates to an alleged refactor of the mdraid code in the kernel around kernel version 4.1. The reason I say alleged is because I can’t confirm the refactoring, however http://www.linuxfromscratch.org/blfs/view/svn/postlfs/mdadm.html also confirms a problem. There is an error there though in that we managed to deadlock a 4.6.2 kernel as well with a reshape on RAID-6 set to add another drive. Simple consistency checks would also result in IO blocking indefinitely. We initially tracked this problem on a server that does not run GlusterFS but does a load of rsync (the rsync network traffic exceeds 100mbps on a daily basis, with disk access of up to 400MB/s that we’ve observed). This system would crawl to a halt, and as per explanation would starve to the point of requiring a reboot.

As you can imagine – any kind of I/O deadlock is detrimental to the overall system.

As an example, the scenarios above caused more than one split brain. Consider this sequence of events:

  1. client creates a write
  2. one of the bricks deadlocks and never returns.
  3. eventually you reboot that server
  4. in the meantime the other server has starved too, also requiring a reboot (healing won’t happen in this state)
  5. now whilst that is rebooting another write to the same file occurs to the original brick that knows nothing of this file.

Trivial enough.

Failure to heal – and split-brain scenarios

I’m not going to go into detail in these, but I’m going to outline the scenarios we’ve seen.

Some of these will require some understanding of how glusterfs stores files on the bricks. Specifically you need to understand a little about gfid files. Joe Julian gives a good explanation on his blog. It boils down to each file has a cluster-wide unique UUID value, called a GFID. For each GFID present on the brick you should find a file in .glusterfs/aa/bb/aabbcc…. where aabbcc is the first six digits of the gfid of he file/folder. In the case of folders this is a symlink to the actual folder, in the case of files this is a hardlink.

Split-gfid on files

I call these split gfid due to the nature thereof. I wish I had some examples, but it boils down to a file has two different gfid values on different bricks. So let’s say /a will have a GFID value of aabbccdd-aabb-ccdd-eeff-aabbccddeeff on one server and wwxxyyzz-bbaa-ddcc-ffee-ffeeddccbbaa on the other. In this case the GFID of the containing folder will report as being in split-brain and we’re stuck. When you do an ls on a client to this fill you will get garbage – a bunch of ??? marks where you’ll expect uid, gid and permission values.

You need to trash one of the two files, including it’s gfid file (assuming that the link-count drops to one after removing the file from the tree). Performing a stat via a client-mount will immediately pick up there is a problem and cause the file to be queued for healing. You can hasten the heal by running:

gluster volume heal ${volname}

Which you should follow up with:

gluster volume heal ${volname} info

To get progress updates. You can also use statistics instead of info to determine whether or not heals are ongoing. If a heal just finished and there is still items listed – check the logs for why the heal didn’t happen. Fix it. An entry in heal represents a split brain waiting to happen.

meta-data split

This particular type of split brain is a very interesting one and I had to perform a few manual repairs even though the files, including ownership and permissions was identical. In my case quite literally only the access and modification times were different. I can reason that we can simply use the greater-off values, but it’s more complicated than this because we need to VERIFY that the content is identical – and given that files can potentially be multiple terabytes large … well, you get the problem. We do want to avoid reading data if we can. Again you can just fire one of the two copies as per above (resulting in a large copy over the network potentially – so perhaps a way to use the greater-off values all else being equal is a good solution).

single-link GFID file

I’ve seen around 20 cases where gluster volume heal … info would give me a in the heal list, and according to the log fail to heal it, with an unspecified error. Upon running stat on the .glusterfs/ located gfid file I noticed the link-count was one. Meaning the file really isn’t present in the filesystem tree any more. There are two types of resolutions. Relink back into the tree or delete the file. Both options is simple. If you want to relink it, simply cp -a it back via a client-mount to where it should go. Either way, you end up needing to rm this file.

Basic optimization

There are loads of options to tweak, and I’m inclined to think the list is growing. So to get a list of all options available to you, run this:

gluster volume get ${volname} all

This will output a large number of options. Some of which some people have touched on in other blogs already. I’m going to highlight only a handful.

Healing options:

cluster.self-heal-daemon: on
cluster.data-self-heal: off
cluster.entry-self-heal: off
cluster.metadata-self-heal: off

The defaults are all-on here. Switching the bottom three off actually improved our performance by a reasonable margin, and from what I can tell the only side effect is that healing will only happen via the self-heal-daemon that should always be running. This has impact on stat() calls from what I understand.

There is a bunch of performance options. A lot of those can be tweaked. Larger caches in theory should result in better performance. We’ve had mixed success with the readdir options and have eventually opted to only set the above options – IE: No tweaking.

When running ls on HUGE folders it is sometimes a tad slow.

There is a number of interesting options in that get list. I recommend you play in a development/test environment.

Conclusion

If you are in need of a clustered filesystem I can definitely recommend GlusterFS as a possible solution. It is obviously not “native disk” performance, and NFS outperforms it too (you did say you wanted no single point of failure right?), but has the advantage of being distributed and much more redundant (NFS generally represents a single point of failure).

In entries to follow I’ll hopefully explore some tips & tricks that could be useful for others too, including a few scripts and “functions” that I use for cleaning up split brains etc (they are in desperate need of some cleanup before they are ready for public consumption).

Comments are closed.