Mirantis OpenStack

Rabbit terminated unexpectedly

Bug #1401948 reported by Ilya Shakhat on 2014-12-12

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Invalid	High	Ilya Shakhat	Mirantis OpenStack 6.0-updates
6.0.x	Invalid	High	Ilya Shakhat	Mirantis OpenStack 6.0-updates
6.1.x	Invalid	High	krogon-intel	Mirantis OpenStack 6.1

Bug Description

RabbitMQ terminated under load from Rally tests.

Log keywords: (full logs are attached)
node-98:
2014-12-12T06:23:15.110223+00:00 notice: ** Generic server <0.639.0> terminating
2014-12-12T06:23:15.110223+00:00 notice: ** Last message in was {mnesia_locker,'rabbit@node-35',granted}
2014-12-12T06:23:15.110223+00:00 notice: ** When Server state == undefined
2014-12-12T06:23:15.110223+00:00 notice: ** Reason for termination ==
2014-12-12T06:23:15.110223+00:00 notice: ** {unexpected_info,{mnesia_locker,'rabbit@node-35',granted}}
2014-12-12T06:23:25.613789+00:00 notice: =ERROR REPORT==== 12-Dec-2014::06:23:17 ===
2014-12-12T06:23:25.613789+00:00 notice: AMQP connection <0.15359.217> (running), channel 0 - error:
2014-12-12T06:23:25.613789+00:00 notice: {amqp_error,connection_forced,
2014-12-12T06:23:25.613789+00:00 notice: "broker forced connection closure with reason 'shutdown'",none}

2014-12-12T06:33:36.030346+00:00 notice: =INFO REPORT==== 12-Dec-2014::06:33:30 ===
2014-12-12T06:33:36.030515+00:00 notice: Error description:
2014-12-12T06:33:36.030515+00:00 notice: {error,{inconsistent_cluster,"Node 'rabbit@node-98' thinks it's clustered with node 'rabbit@node-36', but 'rabbit@node-36' disagrees"}}

node-36:
2014-12-12T06:23:13.311708+00:00 notice: ** Generic server <0.638.0> terminating
2014-12-12T06:23:13.311708+00:00 notice: ** Last message in was {mnesia_locker,'rabbit@node-35',granted}
2014-12-12T06:23:13.311708+00:00 notice: ** When Server state == undefined
2014-12-12T06:23:13.311708+00:00 notice: ** Reason for termination ==
2014-12-12T06:23:13.311708+00:00 notice: ** {unexpected_info,{mnesia_locker,'rabbit@node-35',granted}}

haproxy.log
2014-12-12T06:23:25.009115+00:00 alert: Server rabbitmq/node-35 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 2 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
2014-12-12T06:23:30.009236+00:00 alert: Backup Server rabbitmq/node-98 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
2014-12-12T06:23:30.009367+00:00 alert: Backup Server rabbitmq/node-36 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2014-12-12T06:23:30.009367+00:00 emerg: proxy rabbitmq has no server available!

Tags:

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2014-12-12:

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.0"
  api: "1.0"
  build_number: "49"
  build_id: "2014-12-09_22-41-06"
  astute_sha: "16b252d93be6aaa73030b8100cf8c5ca6a970a91"
  fuellib_sha: "2c99931072d951301d395ebd5bf45c8d401301bb"
  ostf_sha: "a9afb68710d809570460c29d6c3293219d3624d4"
  nailgun_sha: "22bd43b89a17843f9199f92d61fc86cb0f8772f1"
  fuelmain_sha: "3aab16667f47dd8384904e27f70f7a87ba15f4ee"

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2014-12-12:

rabbit-logs.tgz Edit (2.4 MiB, application/x-tar)

Changed in mos:
assignee:	nobody → MOS Oslo (mos-oslo)

Revision history for this message

Ilya Pekelny (i159) wrote on 2014-12-12:

Does this problem reproduces permanently with the same circumstances? Are all the others node services working well after Rabbit went down?

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2014-12-16:

Switching to MOS Linux since the bug is about Rabbit MQ well being

Changed in mos:
assignee:	MOS Oslo (mos-oslo) → MOS Linux (mos-linux)
milestone:	none → 6.0
importance:	Undecided → High
status:	New → Confirmed

Revision history for this message

Aleksander Mogylchenko (amogylchenko) wrote on 2014-12-16:

Ilya, could you please provide the following information:
1. How many nodes were involved in the test and what was the deployment scenario?
2. Do you see this bug constantly, or under certain circumstances? If later, what are they?

Changed in mos:
status:	Confirmed → Incomplete

Revision history for this message

Kostiantyn Danylov (kdanylov) wrote on 2014-12-16:

Probably 3 nodes involved, as I can see from logs

Dmitry Mescheryakov (dmitrymex) on 2014-12-18

Changed in mos:
assignee:	MOS Linux (mos-linux) → Ilya Shakhat (shakhat)

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2014-12-19:

Answers:
1. The issue happened at scale lab with 3 controllers and 97 computes. Deployment HA + Ubuntu, without services (Ceilo, Sahara, Murano)
2. The issue happens once.

Since the issue is most probably occurs rarely and randomly, it would be sufficient to get the following info:
1. What do all these errors in rabbitmq log mean? is it possible to find from them what was wrong?
2. What kind of info we need to collect to simplify debugging in the future?

Take into account that the same may happen at customer deployment and they could certainly ask for reasons.

Revision history for this message

Pavel Chekin (pchekin) wrote on 2014-12-22:

This issue may be related to this one: https://bugs.launchpad.net/mos/+bug/1383120?

I suppose current file descriptors limit for rabbitmq is 1024. Try to decrease it to reproduce the issue. If this is the case, then the issue can be easily solved by increasing file descriptors limit

Dmitry Mescheryakov (dmitrymex) on 2014-12-22

tags:

added: mos-linux

Revision history for this message

Kostiantyn Danylov (kdanylov) wrote on 2014-12-24:

Result of investigation:

1) Generaly there not enought information in this logs to understand what happened.
2) There some information in rabbitmq-sasl.log, which I don't known how to interpreter ( and found no person, who known)
3) This bug isn't connected with https://bugs.launchpad.net/mos/+bug/1383120 (file descriptor limit alarms), as there no alarm message in log, as it should be

Generally looks like system have temporary network troubles near the crush point:

1) At 6:23 there a lot of connection closed due to heartbeat_timeout, and than both nodes 98 and 36 remain silence until 6:34
2) Looks like after rabbit was restarted at 6:34 it works as usually, so problem don't persist

I can't get more information from logs. So I will wait till this issue would be reproduced and we would have access to cluster

Revision history for this message

Michael Semenov (msemenov) wrote on 2014-12-25:

#10

We need to reproduce this bug once again and check the env.

Alexander Ignatov (aignatov) on 2014-12-29

tags:

added: scale

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-02-11:

#11

The bug is hanging in incomplete state for more than a month, hence moving it to invalid

Revision history for this message

krogon-intel (krogon-intel) wrote on 2015-09-18:

#12

bug_1401948.7z Edit (368.5 KiB, application/x-7z-compressed)

Hello, it occurred in our environment.

ENV:
3 controllers in HA, Centos as OS
  release: "6.1"
  openstack_version: "2014.2.2-6.1"
  build_number: "18"
  build_id: "2015-06-29_20-18-55"

# rpm -qa | grep rabbit
fuel-rabbit-fence-6.1.0-6755.1.noarch
rabbitmq-server-3.3.5-1.mira1.noarch

What happened (agree that network disruption may be a problem):
1. One of our controller node under some load got kernel panic (our fault)
2. Rest of rabbitmq servers terminated. Logs in attachment.
3. nova-compute hangs on rabbitmq socket (even after rabbit goes back). nova service-list indicate them as down.

Possible reason: **{unexpected_info, {mnesia_locker, rabbit <at> rabbitmqNode0, granted}}
Erro message indicates this bug: http://comments.gmane.org/gmane.comp.networking.rabbitmq.general/41377
Possible solution: upgrade to 3.4.x or 3.5.1

Revision history for this message

krogon-intel (krogon-intel) wrote on 2015-09-18:

#13

happened in 6.1, details in my last comment

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2015-09-18:

#14

krogon-intel, was RMQ cluster reassembled by corosync/pacemaker?

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-10-20:

#15

krogon-intel, this has been marked as Incomplete for more than a month now without any feedback. I'm closing the bug as Invalid.

Feel free to reopen it, if it's still reproduced and you can provide more data.

Revision history for this message

krogon-intel (krogon-intel) wrote on 2015-11-04:

#16

Vitaly, RMQ had been powered on by pacemeker. However nova-compute services stuck in some deadlock. I will try to reproduce nova-compute hangs and report separate bug for that.

Regarding RMQ termination - solution is to upgrade RMQ to 3.4.x or 3.5.1

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.