OCFS2 intermittently not mountable on a second Node in Ubuntu 20.04.3 LTS
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
ocfs2-tools (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
Environment:
Oracle Cloud Infrastructure (OCI)
2 Nodes + ISCSI block volume with OCFS2
Operating System: Ubuntu 20.04.3 LTS
Kernel: Linux 5.11.0-1027-oracle
ocfs2-tools: 1.8.6-2ubuntu1
Problem:
OCFS2 is intermittently not mountable on a second Node.
Sometimes it works for hours and many reboots.
Sometimes it hangs for hours, even after reboot.
In case of error: If OCFS2 is umounted from first node, mount on second node works.
But sometimes unmount from first node is not possible (hangs and sometimes whole current SSH shell hangs).
ISCSI volume is attached and visible.
O2CB runs without errors.
In older Ubuntu/OCFS2 version this does not happen, e. g.:
Operating System: Ubuntu 18.04.6 LTS
Kernel: Linux 5.4.0-1061-oracle
ocfs2-tools: 1.8.5-3ubuntu1
Also in current Oracle Linux version this does not happen:
Operating System: Oracle Linux Server 7.9
Kernel: Linux 5.4.17-
ocfs2-tools.x86_64: 1.8.6-14.el7
Some error and log messages:
Boot screen on no error:
# console (good, mounted after reboot)
[ OK ] Started Disk Manager.
[ OK ] Finished Load o2cb Modules.
Starting Mount ocfs2 Filesystems...
[ OK ] Found device BlockVolume ubuntu20-ocfs.
Mounting /ocfs2volume...
Boot screen on error:
[ OK ] Found device BlockVolume ubuntu20-ocfs.
Mounting /ocfs2volume...
[FAILED] Failed to mount /ocfs2volume.
See 'systemctl status ocfs2volume.mount' for details.
[DEPEND] Dependency failed for Remote File Systems.
...
node2 login: [ 242.574617] INFO: task mount.ocfs2:1000 blocked for more than 120 seconds.
[ 242.576764] Not tainted 5.11.0-1027-oracle #30~20.04.1-Ubuntu
[ 242.578017] "echo 0 > /proc/sys/
[ 363.405225] INFO: task mount.ocfs2:1000 blocked for more than 241 seconds.
[ 363.407645] Not tainted 5.11.0-1027-oracle #30~20.04.1-Ubuntu
[ 363.410559] "echo 0 > /proc/sys/
[ 484.230345] INFO: task mount.ocfs2:1000 blocked for more than 362 seconds.
[ 484.232416] Not tainted 5.11.0-1027-oracle #30~20.04.1-Ubuntu
[ 484.233670] "echo 0 > /proc/sys/
# systemctl status ocfs2volume.mount
● ocfs2volume.mount - /ocfs2volume
Loaded: loaded (/etc/fstab; generated)
Active: failed (Result: timeout) since Wed 2022-01-19 13:02:04 CET; 8min ago
Where: /ocfs2volume
What: /dev/disk/
Docs: man:fstab(5)
Tasks: 1 (limit: 19138)
Memory: 756.0K
CGroup: /system.
└─1000 /sbin/mount.ocfs2 /dev/sdb /ocfs2volume -o rw,acl,_netdev
Jan 19 13:00:34 node2 systemd[1]: Mounting /ocfs2volume...
Jan 19 13:02:04 node2 systemd[1]: ocfs2volume.mount: Mounting timed out. Terminating.
Jan 19 13:02:04 node2 systemd[1]: ocfs2volume.mount: Mount process exited, code=killed, status=15/TERM
Jan 19 13:02:04 node2 systemd[1]: ocfs2volume.mount: Failed with result 'timeout'.
Jan 19 13:02:04 node2 systemd[1]: Failed to mount /ocfs2volume.
# systemctl status ocfs2
● ocfs2.service - Mount ocfs2 Filesystems
Loaded: loaded (/lib/systemd/
Active: active (exited) since Wed 2022-01-19 13:00:31 CET; 11min ago
Docs: man:ocfs2(7)
Process: 934 ExecStart=
Main PID: 934 (code=exited, status=0/SUCCESS)
Jan 19 13:00:31 node2 systemd[1]: Starting Mount ocfs2 Filesystems...
Jan 19 13:00:31 node2 ocfs2[934]: Starting Oracle Cluster File System (OCFS2)
Jan 19 13:00:31 node2 ocfs2[939]: mount: /ocfs2volume: can't find UUID=7da60327-
Jan 19 13:00:31 node2 systemd[1]: Finished Mount ocfs2 Filesystems.
Jan 19 13:00:31 node2 ocfs2[934]: Failed
# INFO: UUID=7da60327-
# But errors differ in differnent runs
# iscsiadm -m session -P 3
*******
Attached SCSI devices:
*******
Host Number: 3 State: running
scsi3 Channel 00 Id 0 Lun: 0
scsi3 Channel 00 Id 0 Lun: 2
Attached scsi disk sdb State: running
# dmesg
[ 1.337891] scsi 2:0:0:1: Direct-Access ORACLE BlockVolume 1.0 PQ: 0 ANSI: 6
...
[ 7.886896] scsi host3: scsi_eh_3: sleeping
[ 7.897819] scsi host3: iSCSI Initiator over TCP/IP
[ 8.198102] ocfs2: Registered cluster interface o2cb
[ 8.244150] OCFS2 User DLM kernel interface loaded
[ 8.272364] o2hb: Heartbeat mode set to local
[ 8.355866] scsi 3:0:0:0: RAID IET Controller 0001 PQ: 0 ANSI: 5
[ 8.358478] scsi 3:0:0:0: Attached scsi generic sg1 type 12
[ 8.360633] scsi 3:0:0:2: Direct-Access ORACLE BlockVolume 1.0 PQ: 0 ANSI: 6
[ 8.361620] sd 3:0:0:2: Attached scsi generic sg2 type 0
[ 8.373937] sd 3:0:0:2: [sdb] 2147483648 512-byte logical blocks: (1.10 TB/1.00 TiB)
[ 8.373941] sd 3:0:0:2: [sdb] 4096-byte physical blocks
[ 8.374114] sd 3:0:0:2: [sdb] Write Protect is off
[ 8.374117] sd 3:0:0:2: [sdb] Mode Sense: 2b 00 10 08
[ 8.374552] sd 3:0:0:2: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 8.374925] sd 3:0:0:2: [sdb] Optimal transfer size 1048576 bytes
[ 8.436917] sd 3:0:0:2: [sdb] Attached SCSI disk
...
[ 14.820725] o2net: Connected to node node1 (num 0) at 192.168.14.50:7777
[ 18.832591] o2dlm: Joining domain 7DA60327FD564AB
[ 18.832595] (
[ 18.832595] 0
[ 18.832596] 1
[ 18.832597] ) 2 nodes
[ 242.574617] INFO: task mount.ocfs2:1000 blocked for more than 120 seconds.
[ 242.576764] Not tainted 5.11.0-1027-oracle #30~20.04.1-Ubuntu
[ 242.578017] "echo 0 > /proc/sys/
[ 242.579678] task:mount.ocfs2 state:D stack: 0 pid: 1000 ppid: 1 flags:0x00004000
[ 242.579684] Call Trace:
[ 242.579689] __schedule+
[ 242.579696] schedule+0x4f/0xc0
[ 242.579697] schedule_
[ 242.579700] wait_for_
[ 242.579705] __ocfs2_
[ 242.579747] ocfs2_inode_
[ 242.579788] ? ocfs2_inode_
[ 242.579825] ocfs2_journal_
[ 242.579863] ? ocfs2_get_
[ 242.579906] ocfs2_check_
[ 242.579942] ? iput+0x7a/0x200
[ 242.579947] ocfs2_mount_
[ 242.579986] ocfs2_fill_
[ 242.580026] mount_bdev+
[ 242.580030] ? ocfs2_initializ
[ 242.580070] ocfs2_mount+
[ 242.580108] legacy_
[ 242.580113] vfs_get_
[ 242.580116] ? capable+0x19/0x20
[ 242.580120] path_mount+
[ 242.580124] do_mount+0x7c/0xa0
[ 242.580126] __x64_sys_
[ 242.580129] do_syscall_
[ 242.580133] entry_SYSCALL_
[ 242.580136] RIP: 0033:0x7f5d2b9ecdde
[ 242.580138] RSP: 002b:00007ffe17
[ 242.580141] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5d2b9ecdde
[ 242.580143] RDX: 000055fb738f40ae RSI: 000055fb74188340 RDI: 000055fb7418f430
[ 242.580145] RBP: 00007ffe1754ace0 R08: 000055fb7418f410 R09: 00007ffe17548570
[ 242.580146] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe1754abd0
[ 242.580147] R13: 00007ffe1754ab50 R14: 000055fb7418f410 R15: 0000000000000000
# @shutdown:
[*** ] A stop job is running for Login to …lt iSCSI targets (2min 47s / 3min)
[ *] A stop job is running for Login to …lt iSCSI targets (5min 24s / 6min)
Additional information:
The bug happened on two different, independent installations with Ubuntu 20.04 in OCI.
This bug was reported to Oracle too, but Oracle declined processing it as an Oracle bug.
Thanks for the bug report, Norbert.
First of all, I noticed that the ocfs2-tools package contains a delta relative to the Debian package. The delta is trivial, though, and is there just to make sure the package is built only on support architectures (see bug #1745155 for more details). I don't think it is what's causing this bug.
Debian itself carries a few distro-specific patches. Upon inspecting them, I don't think they're directly related to this issue either. So we're left with two possibilities here:
1) This is an upstream bug (which may have been fixed meanwhile), or
2) This is a bug elsewhere (Linux kernel, maybe?)
I did some research on the internet and found some interesting things.
First, I found that it's really hard to obtain the list of patches that have been applied to a package on Oracle [GNU/]Linux! I knew that theoretically already, but now I have seen it for real. Anyway, I was trying to take a look at their package to see if anything strange/interesting came up. No deal.
Then, I decided to do a search for some terms from your bug descriptions. It caught my attention that you were able to provide a stack trace (thanks, by the way!), so I tried searching for one of the functions listed there. And I did find something interesting:
https:/ /support. oracle. com/knowledge/ Oracle% 20Linux% 20and%20Virtual ization/ 2375753_ 1.html
Both the descript of the problem (high I/O wait) and the stack trace provided in the link above match what you're seeing. I don't have an Oracle account so I can't see the rest of the article, but I'm assuming you do, so could you please take a look and see if there is anything useful for you there?
Meanwhile, I'm going to mark this bug as Incomplete (which is the closest we have to a "Waiting" state). When you have more information, please mark it as New again.
Thanks.