Ubuntu
linux package

Bug #1569925
Comment #73

Comment 73 for bug 1569925

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2017-08-26:

#73

Okay, I removed almost everything out of equation:

- removed networking interfaces from systemd
- removed open-iscsi login/logout logic from systemd
- used a single network interface for the logins, on a single iscsi portal

was still able to reproduce the issue by:

- doing a simple login after configuring the network device:
./net-start.sh ; sleep 1 ; sudo iscsiadm -m node --login

- shutting down the network device before any logout:
./net-stop.sh ; sudo shutdown -h now

There was no systemd service order in play (in between open-iscsi, iscsid and networking / ifupdown scripts) and I was able to cause the same issue 100%, same messages, same symptoms. This tells us that, definitely turning down the network while still logged iscsi sessions is causing the hangs (check #3 for why).

Summary of Causes

network shuts down -> iscsi luns are logged out (attempt) -> iscsid daemon is shutdown
or
network shuts down -> iscsi daemon is shutdown -> iscsi luns are logged out

- logout would fail due to lack of network
- iscsi daemon would die and/or logout wouldn't work
- iscsi sessions would still be there
- system would hang (transport layer can't logout)

iscsi daemon is shutdown -> iscsi luns are logged out -> network shuts down

- this would cause the bug i'm mentioning: daemon dies, logout doesn't work.
- some iscsi sessions would still be there
- system would hang (transport layer can't logout)
- (to check: open-iscsi.service ExecStop= running in parallel ?!?)

I used KVM watchdog virtual device + NMI from host to crash the guest during the hang

http://pastebin.ubuntu.com/25394744/

And, finally, the hang is because kernel is hanged during its shutdown logic.

> 0 0 0 ffffffff81e11500 RU 0.0 0 0 [swapper/0]
> 0 0 1 ffff8801a6a20e00 RU 0.0 0 0 [swapper/1]
> 0 0 2 ffff8801a6a21c00 RU 0.0 0 0 [swapper/2]
> 0 0 3 ffff8801a6a22a00 RU 0.0 0 0 [swapper/3]

ALL CPUs were idling during the hang.

crash> runq
CPU 0 RUNQUEUE: ffff8801ae016e00
  CURRENT: PID: 0 TASK: ffffffff81e11500 COMMAND: "swapper/0"
  RT PRIO_ARRAY: ffff8801ae016fb0
     [no tasks queued]
  CFS RB_ROOT: ffff8801ae016e98
     [no tasks queued]

CPU 1 RUNQUEUE: ffff8801ae096e00
  CURRENT: PID: 0 TASK: ffff8801a6a20e00 COMMAND: "swapper/1"
  RT PRIO_ARRAY: ffff8801ae096fb0
     [no tasks queued]
  CFS RB_ROOT: ffff8801ae096e98
     [no tasks queued]

CPU 2 RUNQUEUE: ffff8801ae116e00
  CURRENT: PID: 0 TASK: ffff8801a6a21c00 COMMAND: "swapper/2"
  RT PRIO_ARRAY: ffff8801ae116fb0
     [no tasks queued]
  CFS RB_ROOT: ffff8801ae116e98
     [no tasks queued]

CPU 3 RUNQUEUE: ffff8801ae196e00
  CURRENT: PID: 0 TASK: ffff8801a6a22a00 COMMAND: "swapper/3"
  RT PRIO_ARRAY: ffff8801ae196fb0
     [no tasks queued]
  CFS RB_ROOT: ffff8801ae196e98
     [no tasks queued]

NO task was scheduled to run.

crash> ps -u
PID PPID CPU TASK ST %MEM VSZ RSS COMM
1 0 1 ffff8801a69b8000 UN 0.0 15484 3204 systemd-shutdow

There was just ONE SINGLE user task running (systemd-shutdown)

crash> set ffff8801a69b8000
    PID: 1
COMMAND: "systemd-shutdow"
   TASK: ffff8801a69b8000 [THREAD_INFO: ffff8801a69c0000]
    CPU: 1
  STATE: TASK_UNINTERRUPTIBLE
crash> bt
PID: 1 TASK: ffff8801a69b8000 CPU: 1 COMMAND: "systemd-shutdow"
#0 [ffff8801a69c3a30] __schedule at ffffffff8183e9ee
#1 [ffff8801a69c3a80] schedule at ffffffff8183f0d5
#2 [ffff8801a69c3a98] schedule_timeout at ffffffff81842199
#3 [ffff8801a69c3b40] io_schedule_timeout at ffffffff8183e604
#4 [ffff8801a69c3b70] wait_for_completion_io_timeout at ffffffff8183fc6c
#5 [ffff8801a69c3bd0] blk_execute_rq at ffffffff813cfe10
#6 [ffff8801a69c3c88] scsi_execute at ffffffff815c3fc7
#7 [ffff8801a69c3cc8] scsi_execute_req_flags at ffffffff815c60fe
#8 [ffff8801a69c3d30] sd_sync_cache at ffffffff815d37d7
#9 [ffff8801a69c3da8] sd_shutdown at ffffffff815d3c3c
#10 [ffff8801a69c3dc0] device_shutdown at ffffffff8156112c
#11 [ffff8801a69c3df0] kernel_power_off at ffffffff810a32f5
#12 [ffff8801a69c3e00] SYSC_reboot at ffffffff810a34d3
#13 [ffff8801a69c3f40] sys_reboot at ffffffff810a359e
#14 [ffff8801a69c3f50] entry_SYSCALL_64_fastpath at ffffffff818431f2
    RIP: 00007face7d8f856 RSP: 00007ffd4c271d08 RFLAGS: 00000202
    RAX: ffffffffffffffda RBX: 00007ffd4c271240 RCX: 00007face7d8f856
    RDX: 000000004321fedc RSI: 0000000028121969 RDI: fffffffffee1dead
    RBP: 00007ffd4c271230 R8: 0000000000001400 R9: 0000000000000005
    R10: 00007face88d36c0 R11: 0000000000000202 R12: 00007ffd4c2713d0
    R13: 000000b7e6795c83 R14: 00007ffd4c271c30 R15: 0000000000000001
    ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b

And it called the "kernel_power_off" logic from a system call.

This logic iterates over all devices calling:

if (dev->bus && dev->bus->shutdown) {
dev->bus->shutdown(dev);
} else if (dev->driver && dev->driver->shutdown) {
dev->driver->shutdown(dev);
}

This called sd_shutdown, which called sd_sync_cache, which sent a SYNCHRONIZE_CACHE scsi op code to the device in question (one of the iscsi paths). scsi_execute called blk_execute_rq to put the request on device's queue, for execution.

/* Prevent hang_check timer from firing at us during very long I/O */
hang_check = sysctl_hung_task_timeout_secs;
if (hang_check)
while (!wait_for_completion_io_timeout(&wait, hang_check * (HZ/2)));

The default timeout for that operation is hang_check 120 seconds * 250/2 AND, "wait_for_completion_io_timeout" returns 0, !0 == 1 so.. this will run forever.

So basically kernel is trying o complete SYNCHRONIZE_CACHE operation, the SCSI cmd is already in device's queue and there is no return (for obvious reasons).

Obs 1)
I have tested this with kernel 4.11 with the same symptom. I have to check now: is there any error propagation logic to cancel this so system can shutdown ? We are not getting a SCSI sense message because the "bus" (iscsi/tcp) is gone. Should we hard flush the device's queue ?

Obs 2)
This isn't necessarily related to systemd. IMO this should also be addressed in kernel, allowing a broken transport layer - specially tcp/ip - to allow the shutdown to happen.

Obs 3)
The cause of this can be related to systemd (if there are leftovers) or iscsid daemon (if it died and/or got restarted before logging out. is there a way of keeping history of sessions so it knows what to logout after daemon starts again ?).

I'll get back to this next week, after reading more this dump.

Okay, I removed almost everything out of equation:

- removed networking interfaces from systemd
- removed open-iscsi login/logout logic from systemd
- used a single network interface for the logins, on a single iscsi portal

was still able to reproduce the issue by:

- doing a simple login after configuring the network device: 
  ./net-start.sh ; sleep 1 ; sudo iscsiadm -m node --login

- shutting down the network device before any logout:
  ./net-stop.sh ; sudo shutdown -h now

Summary of Causes

network shuts down -> iscsi luns are logged out (attempt) -> iscsid daemon is shutdown
or
network shuts down -> iscsi daemon is shutdown -> iscsi luns are logged out

- logout would fail due to lack of network
- iscsi daemon would die and/or logout wouldn't work
- iscsi sessions would still be there
- system would hang (transport layer can't logout)

iscsi daemon is shutdown -> iscsi luns are logged out -> network shuts down

I used KVM watchdog virtual device + NMI from host to crash the guest during the hang

http://pastebin.ubuntu.com/25394744/

And, finally, the hang is because kernel is hanged during its shutdown logic.

>     0      0   0  ffffffff81e11500  RU   0.0       0      0  [swapper/0]
>     0      0   1  ffff8801a6a20e00  RU   0.0       0      0  [swapper/1]
>     0      0   2  ffff8801a6a21c00  RU   0.0       0      0  [swapper/2]
>     0      0   3  ffff8801a6a22a00  RU   0.0       0      0  [swapper/3]

ALL CPUs were idling during the hang.

crash> runq
CPU 0 RUNQUEUE: ffff8801ae016e00
  CURRENT: PID: 0      TASK: ffffffff81e11500  COMMAND: "swapper/0"
  RT PRIO_ARRAY: ffff8801ae016fb0
     [no tasks queued]
  CFS RB_ROOT: ffff8801ae016e98
     [no tasks queued]

CPU 1 RUNQUEUE: ffff8801ae096e00
  CURRENT: PID: 0      TASK: ffff8801a6a20e00  COMMAND: "swapper/1"
  RT PRIO_ARRAY: ffff8801ae096fb0
     [no tasks queued]
  CFS RB_ROOT: ffff8801ae096e98
     [no tasks queued]

CPU 2 RUNQUEUE: ffff8801ae116e00
  CURRENT: PID: 0      TASK: ffff8801a6a21c00  COMMAND: "swapper/2"
  RT PRIO_ARRAY: ffff8801ae116fb0
     [no tasks queued]
  CFS RB_ROOT: ffff8801ae116e98
     [no tasks queued]

CPU 3 RUNQUEUE: ffff8801ae196e00
  CURRENT: PID: 0      TASK: ffff8801a6a22a00  COMMAND: "swapper/3"
  RT PRIO_ARRAY: ffff8801ae196fb0
     [no tasks queued]
  CFS RB_ROOT: ffff8801ae196e98
     [no tasks queued]

NO task was scheduled to run.

crash> ps -u
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
      1      0   1  ffff8801a69b8000  UN   0.0   15484   3204  systemd-shutdow

There was just ONE SINGLE user task running (systemd-shutdown)

crash> set ffff8801a69b8000
    PID: 1
COMMAND: "systemd-shutdow"
   TASK: ffff8801a69b8000  [THREAD_INFO: ffff8801a69c0000]
    CPU: 1
  STATE: TASK_UNINTERRUPTIBLE 
crash> bt
PID: 1      TASK: ffff8801a69b8000  CPU: 1   COMMAND: "systemd-shutdow"
 #0 [ffff8801a69c3a30] __schedule at ffffffff8183e9ee
 #1 [ffff8801a69c3a80] schedule at ffffffff8183f0d5
 #2 [ffff8801a69c3a98] schedule_timeout at ffffffff81842199
 #3 [ffff8801a69c3b40] io_schedule_timeout at ffffffff8183e604
 #4 [ffff8801a69c3b70] wait_for_completion_io_timeout at ffffffff8183fc6c
 #5 [ffff8801a69c3bd0] blk_execute_rq at ffffffff813cfe10
 #6 [ffff8801a69c3c88] scsi_execute at ffffffff815c3fc7
 #7 [ffff8801a69c3cc8] scsi_execute_req_flags at ffffffff815c60fe
 #8 [ffff8801a69c3d30] sd_sync_cache at ffffffff815d37d7
 #9 [ffff8801a69c3da8] sd_shutdown at ffffffff815d3c3c
#10 [ffff8801a69c3dc0] device_shutdown at ffffffff8156112c
#11 [ffff8801a69c3df0] kernel_power_off at ffffffff810a32f5
#12 [ffff8801a69c3e00] SYSC_reboot at ffffffff810a34d3
#13 [ffff8801a69c3f40] sys_reboot at ffffffff810a359e
#14 [ffff8801a69c3f50] entry_SYSCALL_64_fastpath at ffffffff818431f2
    RIP: 00007face7d8f856  RSP: 00007ffd4c271d08  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: 00007ffd4c271240  RCX: 00007face7d8f856
    RDX: 000000004321fedc  RSI: 0000000028121969  RDI: fffffffffee1dead
    RBP: 00007ffd4c271230   R8: 0000000000001400   R9: 0000000000000005
    R10: 00007face88d36c0  R11: 0000000000000202  R12: 00007ffd4c2713d0
    R13: 000000b7e6795c83  R14: 00007ffd4c271c30  R15: 0000000000000001
    ORIG_RAX: 00000000000000a9  CS: 0033  SS: 002b

And it called the "kernel_power_off" logic from a system call.

This logic iterates over all devices calling:

if (dev->bus && dev->bus->shutdown) {
    dev->bus->shutdown(dev);
} else if (dev->driver && dev->driver->shutdown) {
    dev->driver->shutdown(dev);
}

/* Prevent hang_check timer from firing at us during very long I/O */
hang_check = sysctl_hung_task_timeout_secs;
if (hang_check)
    while (!wait_for_completion_io_timeout(&wait, hang_check * (HZ/2)));

The default timeout for that operation is hang_check 120 seconds * 250/2 AND, "wait_for_completion_io_timeout" returns 0, !0 == 1 so.. this will run forever.

So basically kernel is trying o complete SYNCHRONIZE_CACHE operation, the SCSI cmd is already in device's queue and there is no return (for obvious reasons).

Obs 2)
This isn't necessarily related to systemd. IMO this should also be addressed in kernel, allowing a broken transport layer - specially tcp/ip - to allow the shutdown to happen.

I'll get back to this next week, after reading more this dump.

Ubuntulinux package

Comment 73 for bug 1569925

Ubuntu
linux package