Brief Description
-----------------
The ceph-osd process was killed on an AIO-SX system. The target was not observed to recover afterwards and a kernel hang was observed on the console.
Severity
--------
Major
Steps to Reproduce
------------------
1. ps -ef | grep ceph-osd. Make note of the pid.
2. Issue a kill -9 <pid>
3. Observe console output and see hung kernel messages
4. System recovery was only possible by rebooting the target
Expected Behavior
------------------
The process is killed, and is then restarted.
Actual Behavior
----------------
The process is killed and then kernel hang is observed on the console. Command line is no longer responsive. System must be rebooted.
Reproducibility
---------------
Intermittent. Did a number of OSD process kills and the issue was only seen once. 1 out of 4 attempts so far.
System Configuration
--------------------
One node system
Branch/Pull Time/Commit
-----------------------
Master as of 20190505T233000Z
Last Pass
---------
First time testing on StarlingX ceph mimic
Timestamp/Logs
--------------
From kern.log:
2019-05-06T18:30:05.512 controller-0 kernel: warning [ 4538.906310] libceph: osd0 192.168.204.3:6800 socket error on write
2019-05-06T18:30:29.450 controller-0 kernel: err [ 4562.830603] INFO: task jbd2/rbd0-8:126770 blocked for more than 120 seconds.
2019-05-06T18:30:29.450 controller-0 kernel: err [ 4562.838488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-05-06T18:30:29.450 controller-0 kernel: info [ 4562.847368] jbd2/rbd0-8 D ffff97948d0222e0 0 126770 2 0x00000000
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847372] Call Trace:
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847381] [<ffffffffb380fb30>] ? bit_wait+0x50/0x50
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847385] [<ffffffffb3812709>] schedule+0x29/0x70
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847391] [<ffffffffb380f4f9>] schedule_timeout+0x289/0x330
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847397] [<ffffffffb3316e49>] ? __blk_mq_run_hw_queue+0x69/0xa0
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847404] [<ffffffffb30e5cc2>] ? ktime_get_ts64+0x52/0xf0
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847408] [<ffffffffb312ff9f>] ? delayacct_end+0x8f/0xb0
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847411] [<ffffffffb30e5cc2>] ? ktime_get_ts64+0x52/0xf0
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847413] [<ffffffffb380fb30>] ? bit_wait+0x50/0x50
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847415] [<ffffffffb381121d>] io_schedule_timeout+0xad/0x130
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847416] [<ffffffffb38112b8>] io_schedule+0x18/0x20
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847418] [<ffffffffb380fb41>] bit_wait_io+0x11/0x50
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847420] [<ffffffffb380f667>] __wait_on_bit+0x67/0x90
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847423] [<ffffffffb380fb30>] ? bit_wait+0x50/0x50
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847427] [<ffffffffb380f7d1>] out_of_line_wait_on_bit+0x81/0xb0
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847434] [<ffffffffb30a7060>] ? wake_bit_function+0x40/0x40
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847445] [<ffffffffb323979a>] __wait_on_buffer+0x2a/0x30
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847474] [<ffffffffc06d58a6>] jbd2_journal_commit_transaction+0xfd6/0x1c90 [jbd2]
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847479] [<ffffffffb30b355e>] ? finish_task_switch+0x4e/0x1b0
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847488] [<ffffffffc06dba9d>] kjournald2+0xcd/0x260 [jbd2]
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847491] [<ffffffffb30a6fa0>] ? wake_up_atomic_t+0x30/0x30
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847498] [<ffffffffc06db9d0>] ? commit_timeout+0x10/0x10 [jbd2]
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847501] [<ffffffffb30a5ee1>] kthread+0xd1/0xe0
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847505] [<ffffffffb30a5e10>] ? kthread_create_on_node+0x140/0x140
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847509] [<ffffffffb3816e1d>] ret_from_fork_nospec_begin+0x7/0x21
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847513] [<ffffffffb30a5e10>] ? kthread_create_on_node+0x140/0x140
2019-05-06T18:30:29.467 controller-0 kernel: err [ 4562.847516] INFO: task jbd2/rbd1-8:126832 blocked for more than 120 seconds.
2019-05-06T18:30:29.467 controller-0 kernel: err [ 4562.855438] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Test Activity
-------------
Feature
Same behavior on BM 2+2+2.