Rabbit terminated unexpectedly

Bug #1401948 reported by Ilya Shakhat
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
High
Ilya Shakhat
6.0.x
Invalid
High
Ilya Shakhat
6.1.x
Invalid
High
krogon-intel

Bug Description

RabbitMQ terminated under load from Rally tests.

Log keywords: (full logs are attached)
node-98:
2014-12-12T06:23:15.110223+00:00 notice: ** Generic server <0.639.0> terminating
2014-12-12T06:23:15.110223+00:00 notice: ** Last message in was {mnesia_locker,'rabbit@node-35',granted}
2014-12-12T06:23:15.110223+00:00 notice: ** When Server state == undefined
2014-12-12T06:23:15.110223+00:00 notice: ** Reason for termination ==
2014-12-12T06:23:15.110223+00:00 notice: ** {unexpected_info,{mnesia_locker,'rabbit@node-35',granted}}
2014-12-12T06:23:25.613789+00:00 notice: =ERROR REPORT==== 12-Dec-2014::06:23:17 ===
2014-12-12T06:23:25.613789+00:00 notice: AMQP connection <0.15359.217> (running), channel 0 - error:
2014-12-12T06:23:25.613789+00:00 notice: {amqp_error,connection_forced,
2014-12-12T06:23:25.613789+00:00 notice: "broker forced connection closure with reason 'shutdown'",none}

2014-12-12T06:33:36.030346+00:00 notice: =INFO REPORT==== 12-Dec-2014::06:33:30 ===
2014-12-12T06:33:36.030515+00:00 notice: Error description:
2014-12-12T06:33:36.030515+00:00 notice: {error,{inconsistent_cluster,"Node 'rabbit@node-98' thinks it's clustered with node 'rabbit@node-36', but 'rabbit@node-36' disagrees"}}

node-36:
2014-12-12T06:23:13.311708+00:00 notice: ** Generic server <0.638.0> terminating
2014-12-12T06:23:13.311708+00:00 notice: ** Last message in was {mnesia_locker,'rabbit@node-35',granted}
2014-12-12T06:23:13.311708+00:00 notice: ** When Server state == undefined
2014-12-12T06:23:13.311708+00:00 notice: ** Reason for termination ==
2014-12-12T06:23:13.311708+00:00 notice: ** {unexpected_info,{mnesia_locker,'rabbit@node-35',granted}}

haproxy.log
2014-12-12T06:23:25.009115+00:00 alert: Server rabbitmq/node-35 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 2 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
2014-12-12T06:23:30.009236+00:00 alert: Backup Server rabbitmq/node-98 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
2014-12-12T06:23:30.009367+00:00 alert: Backup Server rabbitmq/node-36 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2014-12-12T06:23:30.009367+00:00 emerg: proxy rabbitmq has no server available!

Revision history for this message
Ilya Shakhat (shakhat) wrote :

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.0"
  api: "1.0"
  build_number: "49"
  build_id: "2014-12-09_22-41-06"
  astute_sha: "16b252d93be6aaa73030b8100cf8c5ca6a970a91"
  fuellib_sha: "2c99931072d951301d395ebd5bf45c8d401301bb"
  ostf_sha: "a9afb68710d809570460c29d6c3293219d3624d4"
  nailgun_sha: "22bd43b89a17843f9199f92d61fc86cb0f8772f1"
  fuelmain_sha: "3aab16667f47dd8384904e27f70f7a87ba15f4ee"

Revision history for this message
Ilya Shakhat (shakhat) wrote :
Changed in mos:
assignee: nobody → MOS Oslo (mos-oslo)
Revision history for this message
Ilya Pekelny (i159) wrote :

Does this problem reproduces permanently with the same circumstances? Are all the others node services working well after Rabbit went down?

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Switching to MOS Linux since the bug is about Rabbit MQ well being

Changed in mos:
assignee: MOS Oslo (mos-oslo) → MOS Linux (mos-linux)
milestone: none → 6.0
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Aleksander Mogylchenko (amogylchenko) wrote :

Ilya, could you please provide the following information:
1. How many nodes were involved in the test and what was the deployment scenario?
2. Do you see this bug constantly, or under certain circumstances? If later, what are they?

Changed in mos:
status: Confirmed → Incomplete
Revision history for this message
Kostiantyn Danylov (kdanylov) wrote :

Probably 3 nodes involved, as I can see from logs

Changed in mos:
assignee: MOS Linux (mos-linux) → Ilya Shakhat (shakhat)
Revision history for this message
Ilya Shakhat (shakhat) wrote :

Answers:
1. The issue happened at scale lab with 3 controllers and 97 computes. Deployment HA + Ubuntu, without services (Ceilo, Sahara, Murano)
2. The issue happens once.

Since the issue is most probably occurs rarely and randomly, it would be sufficient to get the following info:
1. What do all these errors in rabbitmq log mean? is it possible to find from them what was wrong?
2. What kind of info we need to collect to simplify debugging in the future?

Take into account that the same may happen at customer deployment and they could certainly ask for reasons.

Revision history for this message
Pavel Chekin (pchekin) wrote :

This issue may be related to this one: https://bugs.launchpad.net/mos/+bug/1383120?

I suppose current file descriptors limit for rabbitmq is 1024. Try to decrease it to reproduce the issue. If this is the case, then the issue can be easily solved by increasing file descriptors limit

tags: added: mos-linux
Revision history for this message
Kostiantyn Danylov (kdanylov) wrote :

Result of investigation:

1) Generaly there not enought information in this logs to understand what happened.
2) There some information in rabbitmq-sasl.log, which I don't known how to interpreter ( and found no person, who known)
3) This bug isn't connected with https://bugs.launchpad.net/mos/+bug/1383120 (file descriptor limit alarms), as there no alarm message in log, as it should be

Generally looks like system have temporary network troubles near the crush point:

1) At 6:23 there a lot of connection closed due to heartbeat_timeout, and than both nodes 98 and 36 remain silence until 6:34
2) Looks like after rabbit was restarted at 6:34 it works as usually, so problem don't persist

I can't get more information from logs. So I will wait till this issue would be reproduced and we would have access to cluster

Revision history for this message
Michael Semenov (msemenov) wrote :

We need to reproduce this bug once again and check the env.

tags: added: scale
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The bug is hanging in incomplete state for more than a month, hence moving it to invalid

Revision history for this message
krogon-intel (krogon-intel) wrote :

Hello, it occurred in our environment.

ENV:
3 controllers in HA, Centos as OS
  release: "6.1"
  openstack_version: "2014.2.2-6.1"
  build_number: "18"
  build_id: "2015-06-29_20-18-55"

# rpm -qa | grep rabbit
fuel-rabbit-fence-6.1.0-6755.1.noarch
rabbitmq-server-3.3.5-1.mira1.noarch

What happened (agree that network disruption may be a problem):
1. One of our controller node under some load got kernel panic (our fault)
2. Rest of rabbitmq servers terminated. Logs in attachment.
3. nova-compute hangs on rabbitmq socket (even after rabbit goes back). nova service-list indicate them as down.

Possible reason: **{unexpected_info, {mnesia_locker, rabbit <at> rabbitmqNode0, granted}}
Erro message indicates this bug: http://comments.gmane.org/gmane.comp.networking.rabbitmq.general/41377
Possible solution: upgrade to 3.4.x or 3.5.1

Revision history for this message
krogon-intel (krogon-intel) wrote :

happened in 6.1, details in my last comment

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

krogon-intel, was RMQ cluster reassembled by corosync/pacemaker?

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

krogon-intel, this has been marked as Incomplete for more than a month now without any feedback. I'm closing the bug as Invalid.

Feel free to reopen it, if it's still reproduced and you can provide more data.

Revision history for this message
krogon-intel (krogon-intel) wrote :

Vitaly, RMQ had been powered on by pacemeker. However nova-compute services stuck in some deadlock. I will try to reproduce nova-compute hangs and report separate bug for that.

Regarding RMQ termination - solution is to upgrade RMQ to 3.4.x or 3.5.1

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.