Mirantis OpenStack

Pacemaker RA for RabbitMQ cannot detect when its app hangs and keeps it unresponsive

Bug #1423116 reported by Leontiy Istomin on 2015-02-18

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Fix Committed	Critical	Bogdan Dobrelya	Mirantis OpenStack 6.1
5.0.x	Invalid	Critical	Fuel Library (Deprecated)	Mirantis OpenStack 5.0-updates
5.1.x	Won't Fix	Critical	Fuel Library (Deprecated)	Mirantis OpenStack 5.1.1-updates
6.0.x	Fix Committed	Critical	Bogdan Dobrelya	Mirantis OpenStack 6.0-updates
6.1.x	Fix Committed	Critical	Bogdan Dobrelya	Mirantis OpenStack 6.1

Bug Description

[root@fuel ~]# fuel --fuel-version
api: '1.0'
astute_sha: f7cda2171b0b677dfaeb59693d980a2d3ee4c3e0
auth_required: true
build_id: 2015-02-07_20-50-01
build_number: '76'
feature_groups:
- mirantis
fuellib_sha: 64f3ebe9fcbd18bf6c80a948e06061783a090347
fuelmain_sha: c799e3a6d88289e58db764a6be7910aab7da3149
nailgun_sha: 2ef819732a3ee7acf7b610e7d1c1a6da0434c1a0
ostf_sha: 3b57985d4d2155510894a1f6d03b478b201f7780
production: docker
release: 6.0.1
release_versions:
  2014.2-6.0.1:
    VERSION:
      api: '1.0'
      astute_sha: f7cda2171b0b677dfaeb59693d980a2d3ee4c3e0
      build_id: 2015-02-07_20-50-01
      build_number: '76'
      feature_groups:
      - mirantis
      fuellib_sha: 64f3ebe9fcbd18bf6c80a948e06061783a090347
      fuelmain_sha: c799e3a6d88289e58db764a6be7910aab7da3149
      nailgun_sha: 2ef819732a3ee7acf7b610e7d1c1a6da0434c1a0
      ostf_sha: 3b57985d4d2155510894a1f6d03b478b201f7780
      production: docker
      release: 6.0.1

Baremetal,Ubuntu, HA, Neutron-gre,Ceilometer,Ceph-all, Debug, 6.0.1_76
Controllers:3 Computes:96

I have changed /usr/bin/keystone-all file on each controller node. I have added one line:
147 config.setup_logging()
this change regard to https://bugs.launchpad.net/mos/+bug/1413341

Deployment has been passed successfully, but during full rally test primary controller node has been marked as offline. Also this node is unreachable via ssh.

[root@fuel ~]# ssh node-19
Warning: Permanently added 'node-19' (RSA) to the list of known hosts.
Write failed: Broken pipe

But at the moment I have one opened ssh session which gives the able to execute some commands.

here is output of top command:
http://paste.openstack.org/show/176687/

root@node-19:~# free -m
total used free shared buffers cached
Mem: 32142 31768 373 0 211 11263
-/+ buffers/cache: 20292 11849
Swap: 15624 12 15612

"rabbitmqctl cluster_status" and "rabbitmqctl list_queues" commands just hang on this node

from other controller node:
root@node-52:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-52' ...
[{nodes,[{disc,['rabbit@node-19','rabbit@node-52','rabbit@node-65']}]},
{running_nodes,['rabbit@node-19','rabbit@node-65','rabbit@node-52']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]}]
...done.

root@node-52:~# rabbitmqctl list_queues | grep -v 0$
Listing queues ...
dhcp_agent.node-19 96
notifications.error 415
reply_0c7bc35f0e114b119b959160645ca04a 1
...done.

root@node-19:~# dmesg | grep -i error
[ 9.798790] ACPI Error: [\_SB_.PRAD]
[ 10.883460] ACPI Error: Method parse/execution failed [\_GPE._L24] (Node ffff880853d9d3e8), AE_NOT_FOUND (20131115/psparse-536)
[ 16.284591] ioapic: probe of 0000:00:05.4 failed with error -22
[ 17.779631] ERST: Error Record Serialization Table (ERST) support is initialized.
[ 31.029678] EXT4-fs (sda3): re-mounted. Opts: errors=remount-ro

crm status output is here
http://paste.openstack.org/show/176723/

The last line in rabbitmq log is:
=INFO REPORT==== 18-Feb-2015::10:11:18 ===
accepting AMQP connection <0.9669.490> (192.168.0.54:41674 -> 192.168.0.21:5673)

snapshot will be here asap

See original description

Tags:

Bogdan Dobrelya (bogdando) on 2015-02-18

Changed in mos:
milestone:	none → 6.0.2
milestone:	6.0.2 → 6.0.1
importance:	Undecided → Critical
status:	New → In Progress

Leontiy Istomin (listomin) on 2015-02-18

description:

updated

Revision history for this message

Leontiy Istomin (listomin) wrote on 2015-02-18:

node-19_ipmi.png Edit (17.9 KiB, image/png)

I can't reach bash even via IPMI.
It's very similar with https://bugs.launchpad.net/fuel/+bug/1422186

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-02-18:

The debug session for node-19 (Wed Feb 18 11:22:53 UTC 2015 - Wed Feb 18 13:05:43 UTC 2015) have shown the following:
1) beam process is able to process the AMQP connections (~900 existed within each 10 seconds frame):
- There number of sessions is floating
- There are AMQP messages flow in conversations

2) There are multiple {handshake_timeout,frame_header} and {handshake_timeout,frame_header}
in rabbitmq log started at 17-Feb-2015::14:49:04, Example: http://pastebin.com/tvCMLikr

3) rabbitmqctl status / list* is not responsive at all. The results of strace -s 2048 -T -tt -f -ff -o status.strace rabbitmqctl status are in attachment. And rabbitmqctl "eval" "rabbit_misc:which_applications()." reports the node is down: http://paste.openstack.org/show/176860/

4) pacemaker stopped to process monitor actions for rabbitmq resource at 2015-02-18T08:42:15.104256+00:00, here is the last two rounds http://paste.openstack.org/show/176843/ and there is none after that moment. But note, that pacemaker does not mark the resource as failed or not running, this is odd.

5) there is a "hanged" ocf action monitor which started at Feb 18 8:42 and still existed after 4h and all the time the debug session was in progress http://paste.openstack.org/show/176859/

Postmortem: It looks like the RA action monitor (eval" "rabbit_misc:which_applications()) invoked at 8:42 hanged and brought down the entire application.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-02-18:

status.strace.tgz Edit (29.1 KiB, application/x-tar)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-02-18:

Master: https://review.openstack.org/#/c/157016

summary:

- primary controller has been marked as offline by fuel.
+ Pacemaker RA for RabbitMQ cannot detect when its app hangs and keeps it
+ unresponsive

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-02-18:

To test this patch:
* reproduce the isse (Note, the reproducing steps are unknown, please update the bug description then discovered),
* for affected node, manually update the OCF file for Rabbitmq RA
* for affected node, kill -9 the hanged monitor action and its forks , example http://paste.openstack.org/show/176859/
* it is expected that rabbit node will rejoin the cluster and the issue couldn't be reproduced anymore.

Revision history for this message

Leontiy Istomin (listomin) wrote on 2015-02-18:

crm_status Edit (1.3 KiB, text/plain)

here is some outputs before rebooting

Revision history for this message

Leontiy Istomin (listomin) wrote on 2015-02-18:

dmesg Edit (108.3 KiB, text/plain)

Revision history for this message

Leontiy Istomin (listomin) wrote on 2015-02-18:

free -m Edit (230 bytes, text/plain)

Revision history for this message

Leontiy Istomin (listomin) wrote on 2015-02-18:

lsof -u rabbit Edit (108.0 KiB, text/plain)

Revision history for this message

Leontiy Istomin (listomin) wrote on 2015-02-18:

#10

ps aux Edit (60.7 KiB, text/plain)

Revision history for this message

Leontiy Istomin (listomin) wrote on 2015-02-19:

#11

snapshot is here https://drive.google.com/a/mirantis.com/file/d/0Bx4ptZV1Jt7hYnpqV0Zqc3IyT2s/view?usp=sharing

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2015-05-27:

#12

Invalid for 6.0-updates as the change cannot be packaged

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2015-08-05:

#13

Why status is Fix Committed for the issue - we have no any comments here and any commits on review. need to update the status of the issue.

Bogdan, Leontiy, please, update the status of the issue.

Now status changed to Incomplete for MOS 6.0 and MOS 6.1 because we have only comments about bug reproduction but have no any comments with the fixes (we have one merged commit, but have no updates - does it fix the issue or not).

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-05:

#14

the patch https://review.openstack.org/#/c/157016 was merged and it closes this bug