bash, lockf and flock

So after having had enough I decided it’s time to make torpage integrate better into the portage locking. The current state isn’t bad, it’s just not complete. The fact of the matter is that it is possible should torpage initiate the fetch that portage will (permitting it attempts to fetch the same file) ignore the fact that torpage is busy downloading the file and kick off a dual download. What’s happening is that torpage (currently) treats the mere existence of the file as a lock, whereas portage takes out a fcntl (lockf) based lock on the file. Using the existince/lack of existence of files as the basis for a lock is a bad idea anyway, so I decided it’s time I took what I learned in the last six years and fix it.

So off I go, opening the portage code I looked at six years ago to just refresh my memory on how portage does the lock. It basically creates the file by opening it for read/write, then issuing the lockf function in the fcntl module. Essentially it’s a fresh open so the file position is 0, and the len isn’t passed (probably defaults to zero, indicating infinity, ie, the whole file). It attempts to take an exclusive lock, albeit an advisory (ie, the kernel will keep the data for us but it won’t actually enforce it, that’s what mandatory locks are for).

In the kernel there are two locking functions, fcntl (with operation F_SETLK) and flock. The man pages presents some cryptic information ((2)flock, (2)fcntl, (3)lockf). So I go off and read the Documentations/filesystems/locks.txt and mandatory-locks.txt files in the kernel sources. This reveals that flock was once upon a time (kernels older than 1.3.x from the looks of it) implemented flock on top of fcntl. In 1.3.x the flock emulation code was swapped out for proper flock implementation (compatible with BSD). The problem now is that the two ignore each other, completely. But that, from the looks of it, seems to be the intended behaviour.

Now, looking at (1)flock you will note that bash gives us flock semantics (util-linux-ng @ sftp://ftp.kernel.org/pub/linux/utils/util-linux-ng/ trictly speaking). Which means a snippet like:

exec 4>/usr/portage/distfiles/.fneh.tar.gz.portage_lockfile
trap "rm '/usr/portage/distfiles/.fneh.tar.gz.portage_lockfile'" EXIT
flock -x 4

Will not cause portage to block until we terminate. Quite the contrary, portage will still just proceed to grab the lockf version and continue on it’s merry.

Some (ok, not just some, a lot actually) googling finds us http://rpm.pbone.net/index.php3/stat/4/idpl/5182919/dir//com/bash-builtin-lockf-0.2-alt1.i586.rpm.html – which based on the name looks really, really promising. Until you download the source RPM and after examination it reveals this actualy looks like the current bash built-in flock version 0.1-beta1. Except the packages is all wrong from the looks of it. This essentially turns flock into a bash builtin.

This leaves me with some possible solutions:

1. Learn some more python, hook into the portage package and use their code to base torpage on top of.
2. Take the referenced code above as a basis for writing a lockf variant (shouldn’t be too hard).
3. Take whatever partition houses /usr/portage/distfiles/ and enable mandatory locking.
4. Alter flock(1) to add an option for using lockf instead of flock.

Option three looks like the simplest solution at first glance, however, it also won’t work any better than what I currently have. Basically once mandatory locking is enabled I thought of just trying to write one byte to the file – this will in the portage-has-the-lock case cause torpage to block, however, I still won’t be able to cause portage to block. Option 1 is probably the right thing to do but also the least attractive imho (almost complete rewrite of torpage). Or possibly invoke emerge/ebuild to fetch the files for the package. This however re-introduces a very old bug I had in portage. The files that needs to be fetched varies based on architecture and USE flags. This is problematic. In other words, I’m not even sure whether option 1 is viable.

In essence it seems option 2 and/or 4 is the most feasible in short time. The question between 2 and 4 is whether or not a single option to the existing flock is better, or whether one should rather write a separate utility. One should note that fcntl G_SETLK doesn’t differentiate between exclusive (write) and shared (read) locks. For this reason a separate utility may be better, on the other hand, a flag -f (for fcntl) may well be a simpler solution, and may well be something that could possibly be accepted upstream (given some motivation, which I don’t think will be too difficult considering the context of this discussion). On the other hand, cloning the utility and modifying allows me to package it straight into torpage (until upstream merges and releases) as an immediate solution. The advantage of a simple patch is that all the other functionality (-o, -c etc …) comes for free even though I personally only really care about the file-descriptor based case.

Based on the code it looks like flock() on an fd applies to the group of processes sharing the file description. To be exact (ignoring errors):

flock(fd, LOCK_EX)

is equivalent to:

pid_t p = fork();

if (p == 0) {
     flock(fd, LOCK_EX);
     exit(0);
}

waitpid(p, NULL, 0);

Albeit crap slowly.

This needs to be confirmed by testing, and then it needs to be confirmed that the fcntl based version has the same semantics. This is simpler than one would initially imagine. One can use bash to perform the initial testing against flock. Basically when bash executes a command it implicitly does the fork for us, so what we need is a command that has similar semantics to flock. In this case a program taking a single parameter, blocking for the lock, returning 0 if, and only if the lock was successfully obtained:

#include 
#include 
#include 

int main(int argc, char **argv) {
    long int fd;
    char *eon;

    if (argc < 2) {
        fprintf(stderr, "USAGE: %s fd\n", *argv);
        return 2;
    }

    fd = strtol(argv[1], &eon, 0);
    if (*eon) {
        fprintf(stderr, "USAGE: %s fd\n", *argv);
        return 2;
    }

    if (flock(fd, LOCK_EX) == 0)
        return 0;
    return 1;
}

This just needs to be compiled using something like "gcc -o takelock takelock.c" (assuming the content is stored in takelock.c) and placed on a partition mounted with exec perms. Now, open up two terminals and in both execute the following command:

exec 4>/tmp/lockfile

This will open /tmp/lockfile for writing in both terminals, attached to file descriptor number 4. In the one terminal now execute "flock -x 4", which should terminate with exit code 0 (can be confirmed with echo $?). In the other you now need to execute our takelock executable (./takelock 4) which should block until you type "flock -u 4" in the other terminal. To release the lock in the takelock terminal you can also use flock -u 4.

Updating the the code above to use lockf leaves us with this snippet:

#include 
#include 
#include 

int main(int argc, char **argv) {
    long int fd;
    char *eon;

    if (argc < 2) {
        fprintf(stderr, "USAGE: %s fd\n", *argv);
        return 2;
    }

    fd = strtol(argv[1], &eon, 0);
    if (*eon) {
        fprintf(stderr, "USAGE: %s fd\n", *argv);
        return 2;
    }

    if (lockf(fd, F_LOCK, 0) == 0)
        return 0;
    return 1;
}

Running this first in the one terminal and then in the other reveals that the semantics differ and doesn't work. We'll be able to enforce the -c semantics from the flock command, but not the fd semantics. Which is a shame really, but there is probably good reasons for it behaving this way. Again, this can be confirmed by adding a sleep(10); to the lockf if statement. Something like:

    if (lockf(fd, F_LOCK, 0) == 0) {
        fprintf(stderr, "Got lock!\n");
        sleep(10);
        return 0;
    }

And to then run both, about 5 seconds apart. If this still doesn't convince you, not much will.

This still allows us to implement it as a bash builtin, or to rewrite portage to take the lock and have a sub-shell actually doing the download. Either way, that's more work than what I'm willing to do tonight.

Option 4 is thus eliminated (for the most part) seeing that lockf's semantics isn't similar enough. Option 2 may still be viable, but hard to motivate. The altlinux code suggests that one can modularly add bash built-ins - perhaps this is still an option, except that adding a global built-in is not a clean solution, so I'm in two minds as to whether or not to attempt it. A quick grep of the bash-4.0 source code however also suggests that altlinux hacked bash to add the required capabilities to load builtins as modules.

*cheers*. Here's to hoping that this helps someone else looking to understand the difference and nuances of these two rather confusing locking mechanisms.

2 Responses to “bash, lockf and flock”

  1. joe says:

    If you had only opted for option 4, I’d probably be using your code right now…

  2. Jaco Kroon says:

    Please read the entire post as to why this wasn’t feasible. Then I probably should have linked jkroon.blogs.uls.co.za/it/scriptingprogramming/bash-file-descriptors-pipes-and-lockf in there somewhere as that does contain a working solution.