Okay, I removed almost everything out of equation:
- removed networking interfaces from systemd
- removed open-iscsi login/logout logic from systemd
- used a single network interface for the logins, on a single iscsi portal
was still able to reproduce the issue by:
- doing a simple login after configuring the network device:
./net-start.sh ; sleep 1 ; sudo iscsiadm -m node --login
- shutting down the network device before any logout:
./net-stop.sh ; sudo shutdown -h now
There was no systemd service order in play (in between open-iscsi, iscsid and networking / ifupdown scripts) and I was able to cause the same issue 100%, same messages, same symptoms. This tells us that, definitely turning down the network while still logged iscsi sessions is causing the hangs (check #3 for why).
Summary of Causes
1)
network shuts down -> iscsi luns are logged out (attempt) -> iscsid daemon is shutdown
or
network shuts down -> iscsi daemon is shutdown -> iscsi luns are logged out
- logout would fail due to lack of network
- iscsi daemon would die and/or logout wouldn't work
- iscsi sessions would still be there
- system would hang (transport layer can't logout)
2)
iscsi daemon is shutdown -> iscsi luns are logged out -> network shuts down
- this would cause the bug i'm mentioning: daemon dies, logout doesn't work.
- some iscsi sessions would still be there
- system would hang (transport layer can't logout)
- (to check: open-iscsi.service ExecStop= running in parallel ?!?)
3)
I used KVM watchdog virtual device + NMI from host to crash the guest during the hang
crash> ps -u
PID PPID CPU TASK ST %MEM VSZ RSS COMM
1 0 1 ffff8801a69b8000 UN 0.0 15484 3204 systemd-shutdow
There was just ONE SINGLE user task running (systemd-shutdown)
crash> set ffff8801a69b8000
PID: 1
COMMAND: "systemd-shutdow"
TASK: ffff8801a69b8000 [THREAD_INFO: ffff8801a69c0000]
CPU: 1
STATE: TASK_UNINTERRUPTIBLE
crash> bt
PID: 1 TASK: ffff8801a69b8000 CPU: 1 COMMAND: "systemd-shutdow"
#0 [ffff8801a69c3a30] __schedule at ffffffff8183e9ee
#1 [ffff8801a69c3a80] schedule at ffffffff8183f0d5
#2 [ffff8801a69c3a98] schedule_timeout at ffffffff81842199
#3 [ffff8801a69c3b40] io_schedule_timeout at ffffffff8183e604
#4 [ffff8801a69c3b70] wait_for_completion_io_timeout at ffffffff8183fc6c
#5 [ffff8801a69c3bd0] blk_execute_rq at ffffffff813cfe10
#6 [ffff8801a69c3c88] scsi_execute at ffffffff815c3fc7
#7 [ffff8801a69c3cc8] scsi_execute_req_flags at ffffffff815c60fe
#8 [ffff8801a69c3d30] sd_sync_cache at ffffffff815d37d7
#9 [ffff8801a69c3da8] sd_shutdown at ffffffff815d3c3c
#10 [ffff8801a69c3dc0] device_shutdown at ffffffff8156112c
#11 [ffff8801a69c3df0] kernel_power_off at ffffffff810a32f5
#12 [ffff8801a69c3e00] SYSC_reboot at ffffffff810a34d3
#13 [ffff8801a69c3f40] sys_reboot at ffffffff810a359e
#14 [ffff8801a69c3f50] entry_SYSCALL_64_fastpath at ffffffff818431f2
RIP: 00007face7d8f856 RSP: 00007ffd4c271d08 RFLAGS: 00000202
RAX: ffffffffffffffda RBX: 00007ffd4c271240 RCX: 00007face7d8f856
RDX: 000000004321fedc RSI: 0000000028121969 RDI: fffffffffee1dead
RBP: 00007ffd4c271230 R8: 0000000000001400 R9: 0000000000000005
R10: 00007face88d36c0 R11: 0000000000000202 R12: 00007ffd4c2713d0
R13: 000000b7e6795c83 R14: 00007ffd4c271c30 R15: 0000000000000001
ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b
And it called the "kernel_power_off" logic from a system call.
This logic iterates over all devices calling:
if (dev->bus && dev->bus->shutdown) {
dev->bus->shutdown(dev);
} else if (dev->driver && dev->driver->shutdown) {
dev->driver->shutdown(dev);
}
This called sd_shutdown, which called sd_sync_cache, which sent a SYNCHRONIZE_CACHE scsi op code to the device in question (one of the iscsi paths). scsi_execute called blk_execute_rq to put the request on device's queue, for execution.
/* Prevent hang_check timer from firing at us during very long I/O */
hang_check = sysctl_hung_task_timeout_secs;
if (hang_check)
while (!wait_for_completion_io_timeout(&wait, hang_check * (HZ/2)));
The default timeout for that operation is hang_check 120 seconds * 250/2 AND, "wait_for_completion_io_timeout" returns 0, !0 == 1 so.. this will run forever.
So basically kernel is trying o complete SYNCHRONIZE_CACHE operation, the SCSI cmd is already in device's queue and there is no return (for obvious reasons).
Obs 1)
I have tested this with kernel 4.11 with the same symptom. I have to check now: is there any error propagation logic to cancel this so system can shutdown ? We are not getting a SCSI sense message because the "bus" (iscsi/tcp) is gone. Should we hard flush the device's queue ?
Obs 2)
This isn't necessarily related to systemd. IMO this should also be addressed in kernel, allowing a broken transport layer - specially tcp/ip - to allow the shutdown to happen.
Obs 3)
The cause of this can be related to systemd (if there are leftovers) or iscsid daemon (if it died and/or got restarted before logging out. is there a way of keeping history of sessions so it knows what to logout after daemon starts again ?).
I'll get back to this next week, after reading more this dump.
Okay, I removed almost everything out of equation:
- removed networking interfaces from systemd
- removed open-iscsi login/logout logic from systemd
- used a single network interface for the logins, on a single iscsi portal
was still able to reproduce the issue by:
- doing a simple login after configuring the network device:
./net-start.sh ; sleep 1 ; sudo iscsiadm -m node --login
- shutting down the network device before any logout:
./net-stop.sh ; sudo shutdown -h now
There was no systemd service order in play (in between open-iscsi, iscsid and networking / ifupdown scripts) and I was able to cause the same issue 100%, same messages, same symptoms. This tells us that, definitely turning down the network while still logged iscsi sessions is causing the hangs (check #3 for why).
Summary of Causes
1)
network shuts down -> iscsi luns are logged out (attempt) -> iscsid daemon is shutdown
or
network shuts down -> iscsi daemon is shutdown -> iscsi luns are logged out
- logout would fail due to lack of network
- iscsi daemon would die and/or logout wouldn't work
- iscsi sessions would still be there
- system would hang (transport layer can't logout)
2)
iscsi daemon is shutdown -> iscsi luns are logged out -> network shuts down
- this would cause the bug i'm mentioning: daemon dies, logout doesn't work.
- some iscsi sessions would still be there
- system would hang (transport layer can't logout)
- (to check: open-iscsi.service ExecStop= running in parallel ?!?)
3)
I used KVM watchdog virtual device + NMI from host to crash the guest during the hang
http:// pastebin. ubuntu. com/25394744/
And, finally, the hang is because kernel is hanged during its shutdown logic.
> 0 0 0 ffffffff81e11500 RU 0.0 0 0 [swapper/0]
> 0 0 1 ffff8801a6a20e00 RU 0.0 0 0 [swapper/1]
> 0 0 2 ffff8801a6a21c00 RU 0.0 0 0 [swapper/2]
> 0 0 3 ffff8801a6a22a00 RU 0.0 0 0 [swapper/3]
ALL CPUs were idling during the hang.
crash> runq
CPU 0 RUNQUEUE: ffff8801ae016e00
CURRENT: PID: 0 TASK: ffffffff81e11500 COMMAND: "swapper/0"
RT PRIO_ARRAY: ffff8801ae016fb0
[no tasks queued]
CFS RB_ROOT: ffff8801ae016e98
[no tasks queued]
CPU 1 RUNQUEUE: ffff8801ae096e00
CURRENT: PID: 0 TASK: ffff8801a6a20e00 COMMAND: "swapper/1"
RT PRIO_ARRAY: ffff8801ae096fb0
[no tasks queued]
CFS RB_ROOT: ffff8801ae096e98
[no tasks queued]
CPU 2 RUNQUEUE: ffff8801ae116e00
CURRENT: PID: 0 TASK: ffff8801a6a21c00 COMMAND: "swapper/2"
RT PRIO_ARRAY: ffff8801ae116fb0
[no tasks queued]
CFS RB_ROOT: ffff8801ae116e98
[no tasks queued]
CPU 3 RUNQUEUE: ffff8801ae196e00
CURRENT: PID: 0 TASK: ffff8801a6a22a00 COMMAND: "swapper/3"
RT PRIO_ARRAY: ffff8801ae196fb0
[no tasks queued]
CFS RB_ROOT: ffff8801ae196e98
[no tasks queued]
NO task was scheduled to run.
crash> ps -u
PID PPID CPU TASK ST %MEM VSZ RSS COMM
1 0 1 ffff8801a69b8000 UN 0.0 15484 3204 systemd-shutdow
There was just ONE SINGLE user task running (systemd-shutdown)
crash> set ffff8801a69b8000 TIBLE completion_ io_timeout at ffffffff8183fc6c req_flags at ffffffff815c60fe 64_fastpath at ffffffff818431f2
PID: 1
COMMAND: "systemd-shutdow"
TASK: ffff8801a69b8000 [THREAD_INFO: ffff8801a69c0000]
CPU: 1
STATE: TASK_UNINTERRUP
crash> bt
PID: 1 TASK: ffff8801a69b8000 CPU: 1 COMMAND: "systemd-shutdow"
#0 [ffff8801a69c3a30] __schedule at ffffffff8183e9ee
#1 [ffff8801a69c3a80] schedule at ffffffff8183f0d5
#2 [ffff8801a69c3a98] schedule_timeout at ffffffff81842199
#3 [ffff8801a69c3b40] io_schedule_timeout at ffffffff8183e604
#4 [ffff8801a69c3b70] wait_for_
#5 [ffff8801a69c3bd0] blk_execute_rq at ffffffff813cfe10
#6 [ffff8801a69c3c88] scsi_execute at ffffffff815c3fc7
#7 [ffff8801a69c3cc8] scsi_execute_
#8 [ffff8801a69c3d30] sd_sync_cache at ffffffff815d37d7
#9 [ffff8801a69c3da8] sd_shutdown at ffffffff815d3c3c
#10 [ffff8801a69c3dc0] device_shutdown at ffffffff8156112c
#11 [ffff8801a69c3df0] kernel_power_off at ffffffff810a32f5
#12 [ffff8801a69c3e00] SYSC_reboot at ffffffff810a34d3
#13 [ffff8801a69c3f40] sys_reboot at ffffffff810a359e
#14 [ffff8801a69c3f50] entry_SYSCALL_
RIP: 00007face7d8f856 RSP: 00007ffd4c271d08 RFLAGS: 00000202
RAX: ffffffffffffffda RBX: 00007ffd4c271240 RCX: 00007face7d8f856
RDX: 000000004321fedc RSI: 0000000028121969 RDI: fffffffffee1dead
RBP: 00007ffd4c271230 R8: 0000000000001400 R9: 0000000000000005
R10: 00007face88d36c0 R11: 0000000000000202 R12: 00007ffd4c2713d0
R13: 000000b7e6795c83 R14: 00007ffd4c271c30 R15: 0000000000000001
ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b
And it called the "kernel_power_off" logic from a system call.
This logic iterates over all devices calling:
if (dev->bus && dev->bus->shutdown) { >bus->shutdown( dev); >shutdown) { >driver- >shutdown( dev);
dev-
} else if (dev->driver && dev->driver-
dev-
}
This called sd_shutdown, which called sd_sync_cache, which sent a SYNCHRONIZE_CACHE scsi op code to the device in question (one of the iscsi paths). scsi_execute called blk_execute_rq to put the request on device's queue, for execution.
/* Prevent hang_check timer from firing at us during very long I/O */ hung_task_ timeout_ secs; for_completion_ io_timeout( &wait, hang_check * (HZ/2)));
hang_check = sysctl_
if (hang_check)
while (!wait_
The default timeout for that operation is hang_check 120 seconds * 250/2 AND, "wait_for_ completion_ io_timeout" returns 0, !0 == 1 so.. this will run forever.
So basically kernel is trying o complete SYNCHRONIZE_CACHE operation, the SCSI cmd is already in device's queue and there is no return (for obvious reasons).
Obs 1)
I have tested this with kernel 4.11 with the same symptom. I have to check now: is there any error propagation logic to cancel this so system can shutdown ? We are not getting a SCSI sense message because the "bus" (iscsi/tcp) is gone. Should we hard flush the device's queue ?
Obs 2)
This isn't necessarily related to systemd. IMO this should also be addressed in kernel, allowing a broken transport layer - specially tcp/ip - to allow the shutdown to happen.
Obs 3)
The cause of this can be related to systemd (if there are leftovers) or iscsid daemon (if it died and/or got restarted before logging out. is there a way of keeping history of sessions so it knows what to logout after daemon starts again ?).
I'll get back to this next week, after reading more this dump.