nfs mount options seem to trigger failure and dead mount
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Launchpad Mojo Specs |
Fix Released
|
High
|
Colin Watson |
Bug Description
We have been experiencing a number of nfs mount timeouts on the turnip machines.
Upon investigation, a common error is in the ganesha logs on the nfs host:
rpc :TIRPC :EVENT :svc_ioq_flushv() writev failed (11)
The 11 indicates an error of EAGAIN - which suggests that it should be trying again to send
to the client. The client, however, appears to no longer be listening.
Upon investigation I found a report of similar behavior in the ganesha.nfs server:
https://<email address hidden>
This suggests that it is experiencing an error, and the client is considering the nfs mount dead - which is backed up by what we see in the logs on the turnip machine. What should happen under normal circumstances is the client keeps trying. We have disabled this with the 'soft' option and limited retries using the retrans option, to two.
In the case of the above thread, the original reporter switched the 'soft' option off in the NFS mount and it solved the problem. If we can not disable the soft option, we should consider at least increasing the retrans nfs mount option from 2 to something higher like 10 (or whatever other number seems appropriate)
Related branches
- Ioana Lasc (community): Approve
-
Diff: 13 lines (+1/-1)1 file modifiedmojo-lp-git/services (+1/-1)
information type: | Private → Public |
affects: | turnip → launchpad-mojo-specs |
Changed in launchpad-mojo-specs: | |
assignee: | nobody → Colin Watson (cjwatson) |
status: | New → In Progress |
importance: | Undecided → High |
Changed in launchpad-mojo-specs: | |
status: | In Progress → Fix Committed |
Additional information.
This issue recurred after filing this bug. While the previous workaround had been to restart ganesha.nfsd, which did work... this time I attempted to remount instead from the client.
mount -o remount .....
This worked perfectly and restored a functioning mount - which suggests that the issue is indeed on the client end (or at least requires the client to reconnect in order to recover)