watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [java:5783]
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux-azure (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
Hello Team,
I have a Customer who is experiencing this issue once every 2 days and here are the details of the bug :
May 14 05:24:21 localhost kernel: [6006808.160001] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [java:5783]
May 14 05:24:21 localhost kernel: [6006808.160055] Modules linked in: ufs msdos xfs ip6table_filter ip6_tables iptable_filter nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack iptable_security ip_tables x_tables udf crc_itu_t i2c_piix4 hv_balloon joydev i2c_core serio_raw ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_
May 14 05:24:21 localhost kernel: [6006808.160055] CPU: 5 PID: 5783 Comm: java Not tainted 4.13.0-1011-azure #14-Ubuntu
May 14 05:24:21 localhost kernel: [6006808.160055] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
May 14 05:24:21 localhost kernel: [6006808.160055] task: ffff8b91a48fc5c0 task.stack: ffffb5c4cd014000
May 14 05:24:21 localhost kernel: [6006808.160055] RIP: 0010:fsnotify+
May 14 05:24:21 localhost kernel: [6006808.160055] RSP: 0018:ffffb5c4cd
May 14 05:24:21 localhost kernel: [6006808.160055] RAX: 0000000000000001 RBX: ffff8ba0f6246020 RCX: 00000000ffffffff
May 14 05:24:21 localhost kernel: [6006808.160055] RDX: ffff8ba0f6246048 RSI: 0000000000000000 RDI: ffffffff9bc57020
May 14 05:24:21 localhost kernel: [6006808.160055] RBP: ffffb5c4cd017ea8 R08: 0000000000000000 R09: 0000000000000000
May 14 05:24:21 localhost kernel: [6006808.160055] R10: ffffe93042d21080 R11: 0000000000000000 R12: 0000000000000000
May 14 05:24:21 localhost kernel: [6006808.160055] R13: ffff8ba0f6246048 R14: 0000000000000000 R15: 0000000000000000
May 14 05:24:21 localhost kernel: [6006808.160055] FS: 00007f154838a70
May 14 05:24:21 localhost kernel: [6006808.160055] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 14 05:24:21 localhost kernel: [6006808.160055] CR2: 00007f3fcc254000 CR3: 0000000165c38000 CR4: 00000000001406e0
May 14 05:24:21 localhost kernel: [6006808.160055] Call Trace:
May 14 05:24:21 localhost kernel: [6006808.160055] ? new_sync_
May 14 05:24:21 localhost kernel: [6006808.160055] vfs_write+
May 14 05:24:21 localhost kernel: [6006808.160055] ? syscall_
May 14 05:24:21 localhost kernel: [6006808.160055] SyS_write+0x55/0xc0
May 14 05:24:21 localhost kernel: [6006808.160055] do_syscall_
May 14 05:24:21 localhost kernel: [6006808.160055] entry_SYSCALL64
May 14 05:24:21 localhost kernel: [6006808.160055] RIP: 0033:0x7f489076
May 14 05:24:21 localhost kernel: [6006808.160055] RSP: 002b:00007f1548
May 14 05:24:21 localhost kernel: [6006808.160055] RAX: ffffffffffffffda RBX: 00007f1548389380 RCX: 00007f489076a2dd
May 14 05:24:21 localhost kernel: [6006808.160055] RDX: 0000000000001740 RSI: 00007f1548387310 RDI: 000000000000063c
May 14 05:24:21 localhost kernel: [6006808.160055] RBP: 00007f15483872d0 R08: 00007f1548387310 R09: 00007f418a55b0b8
May 14 05:24:21 localhost kernel: [6006808.160055] R10: 00000000005b31ee R11: 0000000000000293 R12: 0000000000001740
May 14 05:24:21 localhost kernel: [6006808.160055] R13: 00007f1548387310 R14: 000000000000063c R15: 00007f40000051e0
The customer is using Elastic and hence he submitted a issue in Elastic Search github post which they are pointing that this is a Kernel issue and not a elastic search Issue :
Attached github post for reference :
https:/
For now, I have asked him to increase the kernel.
The customer wants to know for sure whether this is a Kernel bug. I also asked him to perform Kernel update. However, if he is confirmed that this is a bug in the current Kernel, he is willing to do so in all the 65 servers.
The customer also submitted a bug to the Java process team which seems to be causing the issue,
There reply was it is a kernel issue and the following launchpad link was given although I personally think that is not really the case here. However, I may be wrong :
https:/
This is the Information regarding the Performance of Java process within the customer's CPU
Avg. Load: Avg=3, max=9
CPU: Avg=29, max=73
MEM: Avg=18, max=23
CGROUP:
ubuntu@
12:rdma:/
11:devices:
10:pids:
9:cpuset:/
8:blkio:
7:memory:
6:perf_event:/
5:cpu,cpuacct:
4:net_cls,
3:freezer:/
2:hugetlb:/
1:name=
/etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_
DISTRIB_
DISTRIB_
Kernel Version : 4.13.0-1011-azure #14-Ubuntu
Please let me know your thoughts given the above information. Also, if extra information required, I will be happy to gather and provide you
Regards,
Sriharsha B S,
Microsoft Azure Linux Team
Indeed, this problem should be fixed by the 4.13.0-1017 that's currently in -proposed on its way to -updates. The bug for the race condition should be https:/ /bugs.launchpad .net/ubuntu/ +source/ linux-azure/ +bug/1765564