vfork and O_CLOEXEC causes zfs_mount EBUSY
We can run into a problem where we call into zfs_mount, which in turn calls is_dir_empty, which opens the directory to try and make sure it's empty. The issue with the current approach is that it holds the directory open while it traverses it with readdir, which, due to subtle interaction with the Java JVM, vfork, and exec can cause a tricky race condition resulting in zfs_mount failures.
The approach to resolving the issue in this patch is to drop the usage of readdir altogether, and instead rely on the fact that ZFS stores the number of entries contained in a directory using the st_size field of the stat structure. Thus, if the directory in question is a ZFS directory, we can check to see if it's empty by calling stat() and inspecting the st_size field of structure returned.
The root cause appears to be an interesting race between vfork, exec, and zfs_mount's usage of O_CLOEXEC when calling openat. Here's what is going on:
1. We call zfs_mount, and this in turn calls openat to check if the directory is empty, which results in opening the directory we're trying to mount onto, and increment v_count.
2. As we're in the middle of reading the directory, vfork is called by the JVM and proceeds to exec the jspawnhelper utility. As a result of the vfork, we take an additional hold on the directory, which increments v_count a second time. The semantics of vfork mean the parent process will wait for the child process to exit or exec before the parent can continue; at this point the parent is in the middle of zfs_mount, reading the directory to determine if it's empty or not.
3. The child process exec-ing jspawnhelper gets to the relvm call within exec_args (which is called by exec_common). relvm is the function that releases the parent process, allowing the parent to proceed. The problem is, at this point of calling relvm, the child hasn't yet called close_exec which is responsible for closing the file descriptors inherited from the parent process that have the O_CLOEXEC flag set. Thus, now the parent is eligible to proceed executing zfs_mount, and the child still has a hold on the directory.
4. The parent wakes up, determines the directory is empty, drops it's hold on the directory and attempts to mount the filesystem. At this point, the child jspawnhelper process hasn't gotten to the close_exec function, and still has a hold on the directory. This extra hold is what is preventing the mount from succeeding in the parent process, so we get EBUSY returned back to the parent.