Bug #1394324 “Add RabbitMQ heartbeat support to Pacemaker script...” : Bugs : Fuel for OpenStack

Miroslav Anashkin (manashkin) on 2014-11-19

Changed in fuel:
importance:	Undecided → High
milestone:	none → 6.0
tags:	added: customer-found
Changed in fuel:
assignee:	nobody → Fuel Library Team (fuel-library)

Sergey Vasilenko (xenolog) on 2014-11-19

Changed in fuel:
milestone:	6.0 → 6.1
importance:	High → Medium

Fabrizio Soppelsa (fsoppelsa) on 2014-12-02

tags:

added: release-notes

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-04:

#1

The description looks too generic, please provide a logs, or at least some additional details like - is this 'idle' rabbit can list its users and channels? This one could be as well a duplicate of https://bugs.launchpad.net/fuel/+bug/1396964

Changed in fuel:
status:	New → Incomplete

Revision history for this message

Miroslav Anashkin (manashkin) wrote on 2014-12-19:

#2

logs from failed Rabbit cluster Edit (3.2 MiB, application/octet-stream)

No, this bug is filed to prevent issues, similar to this https://bugs.launchpad.net/fuel/+bug/1373569 but a bit different.
Also it may be related to this private bug:
https://bugs.launchpad.net/fuel/+bug/1374380

Where I encountered this issue last time:

Environment (production, about 100 permanent instances with seldom new instances start).
Highly loaded network and storage, gygabytes per second of storage traffic, 20-40 GBit total network throughput

Fuel 5.0.1, CentOS, HA, no OCF scripts for RabbitMQ (rabbitmq is not under Pacemaker), but autoheal enabled in rabbit.conf.
Oslo messaging updated to latest 1.3.1 from Fuel 5.1.1 repository about 2 weeks ago, with all OpenStack/Glance/Cinder/Neutron services restart.

RabbitMQ cluster after long uptime gradually started to loose queues, probably by TTL expiration.
Status, cluster_status, rabbit logs - everything is OK, but Rabbit does not process messages, does not create new queues and reports 0 messages in all existing queues.

OpenStack services report different errors like
"Lost connection to MySQL server at 'reading initial communication packet'"
or
nova-oslo.messaging._drivers.impl_rabbit INFO: Connected to AMQP server on 127.0.0.1:5673
nova-nova.api.openstack ERROR: Caught error: Timed out waiting for a reply to message ID 3d08739297bd4e498c859c31d0e8a2aa

As it seen, Oslo messaging reports it connected to AMQP server - but cannot get message.

Restarting RabbitMQ server instances one by one does not help - one need to stop all RabbitMQ instances in the cluster and start RabbitMQ server on single node first, then start remained RabbitMQ instances (works with Fuel customized rabbitmq-server init script only) - so RabbitMQ starts with clean Mnesia.
Or stop all RabbitMQ instances, delete Mnesia dir manually on every node and start new RabbitMQ cluster, add users, etc.

Attached please find the nova-all and cinder-all logs from cluster where RabbitMQ failed this way.
Nova log has history for the last month of uptime, cinder log for last 4 days only, since it is too big.
These services started report errors first.
There are logs from other services, if necessary.
Failure happened somewhere between December 14 and 18 and became explicit to the end of December 18.

No, this bug is filed to prevent issues, similar to this https://bugs.launchpad.net/fuel/+bug/1373569 but a bit different.
Also it may be related to this private bug:
https://bugs.launchpad.net/fuel/+bug/1374380

Where I encountered this issue last time:

Environment (production, about 100 permanent instances with seldom new instances start).
Highly loaded network and storage, gygabytes per second of storage traffic, 20-40 GBit total network throughput

Fuel 5.0.1, CentOS, HA, no OCF scripts for RabbitMQ (rabbitmq is not under Pacemaker), but autoheal enabled in rabbit.conf.
Oslo messaging updated to latest 1.3.1 from Fuel 5.1.1 repository about 2 weeks ago, with all OpenStack/Glance/Cinder/Neutron services restart.

RabbitMQ cluster after long uptime gradually started to loose queues, probably by TTL expiration.
Status, cluster_status, rabbit logs  - everything is OK, but Rabbit does not process messages, does not create new queues and reports 0 messages in all existing queues.

OpenStack services report different errors like 
"Lost connection to MySQL server at 'reading initial communication packet'" 
or 
nova-oslo.messaging._drivers.impl_rabbit INFO: Connected to AMQP server on 127.0.0.1:5673
nova-nova.api.openstack ERROR: Caught error: Timed out waiting for a reply to message ID 3d08739297bd4e498c859c31d0e8a2aa

As it seen, Oslo messaging reports it connected to AMQP server - but cannot get message.

Restarting RabbitMQ server instances one by one does not help - one need to stop all RabbitMQ instances in the cluster and start RabbitMQ server on single node first, then start remained RabbitMQ instances (works with Fuel customized rabbitmq-server init script only) - so RabbitMQ starts with clean Mnesia. 
Or stop all RabbitMQ instances, delete Mnesia dir manually on every node and start new RabbitMQ cluster, add users, etc.

Attached please find the nova-all and cinder-all logs from cluster where RabbitMQ failed this way.
Nova log has history for the last month of uptime, cinder log for last 4 days only, since it is too big.
These services started report errors first.
There are logs from other services, if necessary.
Failure happened somewhere between December 14 and 18 and became explicit to the end of December 18.

Changed in fuel:
status:	Incomplete → Confirmed

Revision history for this message

Miroslav Anashkin (manashkin) wrote on 2014-12-19:

#3

There is possibility, the issue with stalled RabbitMQ in HA mode is fixed in RabbitMQ 3.4.0
http://www.rabbitmq.com/release-notes/README-3.4.0.txt

But check if RabbitMQ cluster processes messages still necessary.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-19:

#5

Miroslav, I agree the checks with some test message processing would be a really nice to have. Although, the fixing steps with mnesia cleaning you described above are intended mostly to cleaning all of the 'broken' queues which Oslo.messaging uses for RPC flows. The real root cause of 'idle processing' is a messages like "Caught error: Timed out waiting for a reply to message ID". And these ones still point only to the broken failover in RPC flows. Cleaning the cluster state is an overkill, but should work of course.
The proper fix would be to make sure RPC flows self healed.

Changed in fuel:
milestone:	6.1 → 5.0.3
status:	Confirmed → Won't Fix
assignee:	Fuel Library Team (fuel-library) → MOS Oslo (mos-oslo)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-19:

#6

I assigned this bug to MOS Oslo team as it makes sense first to try to implement a new Rabbitmq parameter for HA failover in Oslo.messaging, see https://bugs.launchpad.net/nova/+bug/856764/comments/70. The next step will be to test it with new Rabbitmq 3.4.0, indeed.

Timur Nurlygayanov (tnurlygayanov) on 2015-04-17

summary:

- Add RAbbitMQ heartbeat support to Pacemaker scripts
+ Add RabbitMQ heartbeat support to Pacemaker scripts

Revision history for this message

Viktor Serhieiev (vsergeyev) wrote on 2015-05-13:

#7

I'm not really sure, that oslo team should edit Pacemaker scripts - these scripts are not related to oslo project at all.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-25:

#8

Note, for the 6.1 release there is improved HA health checks implemented. So, passed OSTF HA health check now ensures everything is OK with underlying AMQP layer. But we don't have checks for the app layer, which are Openstack services running Oslo.messaging code.

Maria Zlatkova (mzlatkova) on 2015-05-26

tags:

added: release-notes-done
removed: release-notes

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-07:

#9

I'm not sure the conclusion to make control plane (the OCF resource agent) to perform AMQP checks with heartbeats is a vallid point. I'd say it is rather not. There are app layer and control plane layers, and the latter one must not take functions from the former one.

Dmitry Pyzhov (dpyzhov) on 2015-10-22

tags:

added: area-library

Dmitry Pyzhov (dpyzhov) on 2015-11-30

Changed in fuel:
milestone:	5.0-updates → 8.0
status:	Won't Fix → Invalid

Vitaly Sedelnik (vsedelnik) on 2016-03-10

tags:

added: wontfix-low

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Invalid	Medium	Fuel Library (Deprecated)	Fuel for OpenStack 8.0
5.1.x	Won't Fix	Medium	MOS Oslo	Fuel for OpenStack 5.1.1-updates
6.0.x	Won't Fix	Medium	MOS Oslo	Fuel for OpenStack 6.0-updates
6.1.x	Won't Fix	Medium	MOS Oslo	Fuel for OpenStack 6.1
7.0.x	Invalid	Medium	Fuel Library (Deprecated)	Fuel for OpenStack 7.0
8.0.x	Invalid	Medium	Fuel Library (Deprecated)	Fuel for OpenStack 8.0

Fuel for OpenStack

Add RabbitMQ heartbeat support to Pacemaker scripts

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches