Add RabbitMQ heartbeat support to Pacemaker scripts

Bug #1394324 reported by Miroslav Anashkin
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Medium
Fuel Library (Deprecated)
5.1.x
Won't Fix
Medium
MOS Oslo
6.0.x
Won't Fix
Medium
MOS Oslo
6.1.x
Won't Fix
Medium
MOS Oslo
7.0.x
Invalid
Medium
Fuel Library (Deprecated)
8.0.x
Invalid
Medium
Fuel Library (Deprecated)

Bug Description

Sometimes RabbitMQ cluster hangs the following way:
All its nodes are up, all the PIDs are in place, rabbitmqctl reports everything is OK.

Actually, RabbitMQ creates only load to CPU and does not process messages.
OpenStack services stalls. Restarting the whole RabbitMQ cluster leads to some message loss.

We implemented heartbeat support for RabbitMQ in OpenStack services.

Let us use heartbeat to determine, if RabbitMQ node actually processes messages.

Changed in fuel:
importance: Undecided → High
milestone: none → 6.0
tags: added: customer-found
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Changed in fuel:
milestone: 6.0 → 6.1
importance: High → Medium
tags: added: release-notes
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The description looks too generic, please provide a logs, or at least some additional details like - is this 'idle' rabbit can list its users and channels? This one could be as well a duplicate of https://bugs.launchpad.net/fuel/+bug/1396964

Changed in fuel:
status: New → Incomplete
Revision history for this message
Miroslav Anashkin (manashkin) wrote :

No, this bug is filed to prevent issues, similar to this https://bugs.launchpad.net/fuel/+bug/1373569 but a bit different.
Also it may be related to this private bug:
https://bugs.launchpad.net/fuel/+bug/1374380

Where I encountered this issue last time:

Environment (production, about 100 permanent instances with seldom new instances start).
Highly loaded network and storage, gygabytes per second of storage traffic, 20-40 GBit total network throughput

Fuel 5.0.1, CentOS, HA, no OCF scripts for RabbitMQ (rabbitmq is not under Pacemaker), but autoheal enabled in rabbit.conf.
Oslo messaging updated to latest 1.3.1 from Fuel 5.1.1 repository about 2 weeks ago, with all OpenStack/Glance/Cinder/Neutron services restart.

RabbitMQ cluster after long uptime gradually started to loose queues, probably by TTL expiration.
Status, cluster_status, rabbit logs - everything is OK, but Rabbit does not process messages, does not create new queues and reports 0 messages in all existing queues.

OpenStack services report different errors like
"Lost connection to MySQL server at 'reading initial communication packet'"
or
nova-oslo.messaging._drivers.impl_rabbit INFO: Connected to AMQP server on 127.0.0.1:5673
nova-nova.api.openstack ERROR: Caught error: Timed out waiting for a reply to message ID 3d08739297bd4e498c859c31d0e8a2aa

As it seen, Oslo messaging reports it connected to AMQP server - but cannot get message.

Restarting RabbitMQ server instances one by one does not help - one need to stop all RabbitMQ instances in the cluster and start RabbitMQ server on single node first, then start remained RabbitMQ instances (works with Fuel customized rabbitmq-server init script only) - so RabbitMQ starts with clean Mnesia.
Or stop all RabbitMQ instances, delete Mnesia dir manually on every node and start new RabbitMQ cluster, add users, etc.

Attached please find the nova-all and cinder-all logs from cluster where RabbitMQ failed this way.
Nova log has history for the last month of uptime, cinder log for last 4 days only, since it is too big.
These services started report errors first.
There are logs from other services, if necessary.
Failure happened somewhere between December 14 and 18 and became explicit to the end of December 18.

Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Miroslav Anashkin (manashkin) wrote :

There is possibility, the issue with stalled RabbitMQ in HA mode is fixed in RabbitMQ 3.4.0
http://www.rabbitmq.com/release-notes/README-3.4.0.txt

But check if RabbitMQ cluster processes messages still necessary.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Miroslav, I agree the checks with some test message processing would be a really nice to have. Although, the fixing steps with mnesia cleaning you described above are intended mostly to cleaning all of the 'broken' queues which Oslo.messaging uses for RPC flows. The real root cause of 'idle processing' is a messages like "Caught error: Timed out waiting for a reply to message ID". And these ones still point only to the broken failover in RPC flows. Cleaning the cluster state is an overkill, but should work of course.
The proper fix would be to make sure RPC flows self healed.

Changed in fuel:
milestone: 6.1 → 5.0.3
status: Confirmed → Won't Fix
assignee: Fuel Library Team (fuel-library) → MOS Oslo (mos-oslo)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I assigned this bug to MOS Oslo team as it makes sense first to try to implement a new Rabbitmq parameter for HA failover in Oslo.messaging, see https://bugs.launchpad.net/nova/+bug/856764/comments/70. The next step will be to test it with new Rabbitmq 3.4.0, indeed.

summary: - Add RAbbitMQ heartbeat support to Pacemaker scripts
+ Add RabbitMQ heartbeat support to Pacemaker scripts
Revision history for this message
Viktor Serhieiev (vsergeyev) wrote :

I'm not really sure, that oslo team should edit Pacemaker scripts - these scripts are not related to oslo project at all.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note, for the 6.1 release there is improved HA health checks implemented. So, passed OSTF HA health check now ensures everything is OK with underlying AMQP layer. But we don't have checks for the app layer, which are Openstack services running Oslo.messaging code.

tags: added: release-notes-done
removed: release-notes
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I'm not sure the conclusion to make control plane (the OCF resource agent) to perform AMQP checks with heartbeats is a vallid point. I'd say it is rather not. There are app layer and control plane layers, and the latter one must not take functions from the former one.

Dmitry Pyzhov (dpyzhov)
tags: added: area-library
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 5.0-updates → 8.0
status: Won't Fix → Invalid
tags: added: wontfix-low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.