Gluster mount using backupvolfile-server fails on boot
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
glusterfs (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
Consider that we have a fstab entry that uses both volfile-server and backupvolfile-
mygluster:
If the host mygluster is accessible at boot time, the mount works with success. However, if mygluster is offline (because a DNS error, for example) and mygluster-bak is online, the mount fails at boot time.
The bug only occurs at boot time. After the boot, if we run 'mount /var/mydir', the mount will work using the mygluster-bak server as expected.
## How to reproduce
Put the following entry in your fstab:
non-
Mount the system and check that the mount worked with success:
$ mount /var/mydir; mount | grep mydir; umount
non-
Now, reboot your system some times and check that sometimes the mount has failed. At this moment, run 'mount /var/mydir' and successfully mount the filesystem.
$ mount | grep mydir
$ mount /var/mydir; mount | grep mydir
non-
### Logs
The boot.log, dmesg, mountall and the gluster logfile (var-lib-
However, the only log that really helps is the gluster logfile, with entries like:
[glusterfsd
[name.
[fuse-
[glusterfsd
[fuse-
[xlator.
## Log analyses and debugging
The "Mountpoint /var/mydir seems to have a stale mount, run 'umount /var/mydir' and try again" log helps a lot.
I've changed the /sbin/mount.
- At first, mount.glusterfs runs: /usr/sbin/glusterfs --volfile-id=/mydir --volfile-
- After, it runs 'stat -c %i /var/mydir' to test if the inode is one (mount successful at this mount point) or another number. In a normal mount try (running 'mount /var/mydir' after the boot), this step returns a large number like 4198417. However, during the boot, it returns no output and prints the following error to stderr: **stat: cannot stat ‘/var/mydir’: Transport endpoint is not connected**;
- In a second moment, mount.glusterfs runs: /usr/sbin/glusterfs --volfile-id=/mydir --volfile-
- Again, it runs 'stat -c %i /var/mydir' and got the error **Transport endpoint is not connected**;
- At the end, mount.glusterfs prints "Mount failed. Please check the log file for more details.", runs "umount /var/mydir" and exits with status 1.
So, I have done some tests to got more info about the "Transport endpoint is not connected" error and I discovered that it occurs for a very shot time after a mount error. It's possible to got this error at any time. The following command will sometimes reproduce this error (it's sporadic):
$ /usr/sbin/glusterfs --volfile-id=/mydir --volfile-
4198417
$ /usr/sbin/glusterfs --volfile-id=/mydir --volfile-
stat: cannot stat ‘/var/mydir’: Transport endpoint is not connected
To get more debug info yet, I've modified the script mount.glusterfs again to run 'fuser -m /var/mydir' after the first 'stat' / "Transport endpoint is not connected" to get any PIDs using the filesystem and got:
$ fuser -m /var/mydir:
1 371 1280 1758 2287 2503
$ ps -ww -up 1 371 1280 1758 2287 2503:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 16.8 0.0 34156 3532 ? Ss 16:36 0:04 /sbin/init
root 371 0.1 0.0 20132 976 ? S 16:36 0:00 @sbin/plymouthd --mode=boot --attach-to-session --pid-file=
statd 1280 0.0 0.0 21540 1396 ? Ss 16:36 0:00 rpc.statd -L
syslog 1758 0.0 0.0 255840 1216 ? Ssl 16:36 0:00 rsyslogd
Unfortunately, I do not get the output of PIDs 2287 and 2503.
I don't know if the second mount error is related to these PIDs returned from 'fuser' or if it's normal, and if it's related to the "Transport endpoint is not connected" error, but it can be a start point.
## Possible solutions and workaround
I studied about the "seems to have a stale mount, run 'umount ...' and try again" message and discovered this commit at the upstream code: https:/
The commit message contains the message "Also, mount.glusterfs script unmounts mount-point on mount failure to
prevent hung mounts". This message is about the umount line in mount.glusterfs: https:/
I do not know if running umount (as implemented after the last mount error and suggested in the gluster log) solves any "Transport endpoint not connected error" or only other specific mount hang, but a possible solution is adding this line before the second mount (after the first failure):
--- mount.glusterfs
+++ mount.glusterfs 2015-06-12 01:24:52.824311071 -0300
@@ -226,6 +226,7 @@
if [ $inode -ne 1 ]; then
err=1;
if [ -n "$cmd_line1" ]; then
+ umount $mount_point > /dev/null 2>&1;
err=0;
After this "patch", the mount point using the backupvolfile-
--- mount.glusterfs
+++ mount.glusterfs 2015-06-12 01:28:07.610199716 -0300
@@ -226,6 +226,7 @@
if [ $inode -ne 1 ]; then
err=1;
if [ -n "$cmd_line1" ]; then
+ sleep 0.1;
err=0;
I tested the last patch using many reboots (more then 60) and in all of them the mount worked.
### Why do not backport some fix from the upstream
Since commit https:/
summary: |
- Gluster mount using volfile-bak fails on boot + Gluster mount using backupvolfile-server fails on boot |
I'm attaching a better patch here (file glusterfs- 3.4.2/xlators/ mount/fuse/ utils/mount. glusterfs. in).