The debug session for node-19 (Wed Feb 18 11:22:53 UTC 2015 - Wed Feb 18 13:05:43 UTC 2015) have shown the following:
1) beam process is able to process the AMQP connections (~900 existed within each 10 seconds frame):
- There number of sessions is floating
- There are AMQP messages flow in conversations
2) There are multiple {handshake_timeout,frame_header} and {handshake_timeout,frame_header}
in rabbitmq log started at 17-Feb-2015::14:49:04, Example: http://pastebin.com/tvCMLikr
3) rabbitmqctl status / list* is not responsive at all. The results of strace -s 2048 -T -tt -f -ff -o status.strace rabbitmqctl status are in attachment. And rabbitmqctl "eval" "rabbit_misc:which_applications()." reports the node is down: http://paste.openstack.org/show/176860/
4) pacemaker stopped to process monitor actions for rabbitmq resource at 2015-02-18T08:42:15.104256+00:00, here is the last two rounds http://paste.openstack.org/show/176843/ and there is none after that moment. But note, that pacemaker does not mark the resource as failed or not running, this is odd.
5) there is a "hanged" ocf action monitor which started at Feb 18 8:42 and still existed after 4h and all the time the debug session was in progress http://paste.openstack.org/show/176859/
Postmortem: It looks like the RA action monitor (eval" "rabbit_misc:which_applications()) invoked at 8:42 hanged and brought down the entire application.
The debug session for node-19 (Wed Feb 18 11:22:53 UTC 2015 - Wed Feb 18 13:05:43 UTC 2015) have shown the following:
1) beam process is able to process the AMQP connections (~900 existed within each 10 seconds frame):
- There number of sessions is floating
- There are AMQP messages flow in conversations
2) There are multiple {handshake_ timeout, frame_header} and {handshake_ timeout, frame_header} 2015::14: 49:04, Example: http:// pastebin. com/tvCMLikr
in rabbitmq log started at 17-Feb-
3) rabbitmqctl status / list* is not responsive at all. The results of strace -s 2048 -T -tt -f -ff -o status.strace rabbitmqctl status are in attachment. And rabbitmqctl "eval" "rabbit_ misc:which_ applications( )." reports the node is down: http:// paste.openstack .org/show/ 176860/
4) pacemaker stopped to process monitor actions for rabbitmq resource at 2015-02- 18T08:42: 15.104256+ 00:00, here is the last two rounds http:// paste.openstack .org/show/ 176843/ and there is none after that moment. But note, that pacemaker does not mark the resource as failed or not running, this is odd.
5) there is a "hanged" ocf action monitor which started at Feb 18 8:42 and still existed after 4h and all the time the debug session was in progress http:// paste.openstack .org/show/ 176859/
Postmortem: It looks like the RA action monitor (eval" "rabbit_ misc:which_ applications( )) invoked at 8:42 hanged and brought down the entire application.