Make RabbitMQ OCF script tolerate rabbitmqctl timeouts
The change makes OCF script ignore small number of timeouts of rabbitmqctl
for 'heavy' operations: list_channels, get_alarms and list_queues.
Number of tolerated timeouts in a row is configured through a new variable
'max_rabbitmqctl_timeouts'. By default it is set to 1, i.e. rabbitmqctl
timeouts are not tolerated at all.
Bug #1487517 is fixed by extracting declaration of local variables
'rc_alarms' and 'rc_queues' from assignment operations.
Text for Operations Guide:
If on node where RabbitMQ is deployed
other processes consume significant part of CPU, RabbitMQ starts
responding slow to queries by 'rabbitmqctl' utility. The utility is
used by RabbitMQ's OCF script to monitor state of the RabbitMQ.
When utility fails to return in pre-defined timeout, OCF script
considers RabbitMQ to be down and restarts it, which might lead to
a limited (several minutes) OpenStack downtime. Such restarts
are undesirable as they cause downtime without benefit. To
mitigate the issue, the OCF script might be told to tolerate
certain amount of rabbitmqctl timeouts in a row using the following
command:
crm_resource --resource p_rabbitmq-server --set-parameter \ max_rabbitmqctl_timeouts --parameter-value N
Here N should be replaced with the number of timeouts. For instance,
if it is set to 3, the OCF script will tolerate two rabbitmqctl
timeouts in a row, but fail if the third one occurs.
By default the parameter is set to 1, i.e. rabbitmqctl timeout is not
tolerated at all. The downside of increasing the parameter is that
if a real issue occurs which causes rabbitmqctl timeout, OCF script
will detect that only after N monitor runs and so the restart, which
might fix the issue, will be delayed.
To understand that RabbitMQ's restart was caused by rabbitmqctl timeout
you should examine lrmd.log of the corresponding controller on Fuel
master node in /var/log/docker-logs/remote/ directory. Here lines like
"the invoked command exited 137: /usr/sbin/rabbitmqctl list_channels ..."
indicate rabbitmqctl timeout. The next line will explain if it
caused restart or not. For example:
"rabbitmqctl timed out 2 of max. 3 time(s) in a row. Doing nothing for now."
Reviewed: https:/ /review. openstack. org/222614 /git.openstack. org/cgit/ stackforge/ fuel-library/ commit/ ?id=a304fac9bf1 ee4e98cfc355e30 58b9664c2768c2
Committed: https:/
Submitter: Jenkins
Branch: stable/6.1
commit a304fac9bf1ee4e 98cfc355e3058b9 664c2768c2
Author: Dmitry Mescheryakov <email address hidden>
Date: Tue Aug 25 17:38:44 2015 +0300
Make RabbitMQ OCF script tolerate rabbitmqctl timeouts
The change makes OCF script ignore small number of timeouts of rabbitmqctl rabbitmqctl_ timeouts' . By default it is set to 1, i.e. rabbitmqctl
for 'heavy' operations: list_channels, get_alarms and list_queues.
Number of tolerated timeouts in a row is configured through a new variable
'max_
timeouts are not tolerated at all.
Bug #1487517 is fixed by extracting declaration of local variables
'rc_alarms' and 'rc_queues' from assignment operations.
Text for Operations Guide:
If on node where RabbitMQ is deployed
max_ rabbitmqctl_ timeouts --parameter-value N
other processes consume significant part of CPU, RabbitMQ starts
responding slow to queries by 'rabbitmqctl' utility. The utility is
used by RabbitMQ's OCF script to monitor state of the RabbitMQ.
When utility fails to return in pre-defined timeout, OCF script
considers RabbitMQ to be down and restarts it, which might lead to
a limited (several minutes) OpenStack downtime. Such restarts
are undesirable as they cause downtime without benefit. To
mitigate the issue, the OCF script might be told to tolerate
certain amount of rabbitmqctl timeouts in a row using the following
command:
crm_resource --resource p_rabbitmq-server --set-parameter \
Here N should be replaced with the number of timeouts. For instance,
if it is set to 3, the OCF script will tolerate two rabbitmqctl
timeouts in a row, but fail if the third one occurs.
By default the parameter is set to 1, i.e. rabbitmqctl timeout is not
tolerated at all. The downside of increasing the parameter is that
if a real issue occurs which causes rabbitmqctl timeout, OCF script
will detect that only after N monitor runs and so the restart, which
might fix the issue, will be delayed.
To understand that RabbitMQ's restart was caused by rabbitmqctl timeout docker- logs/remote/ directory. Here lines like rabbitmqctl list_channels ..."
you should examine lrmd.log of the corresponding controller on Fuel
master node in /var/log/
"the invoked command exited 137: /usr/sbin/
indicate rabbitmqctl timeout. The next line will explain if it
caused restart or not. For example:
"rabbitmqctl timed out 2 of max. 3 time(s) in a row. Doing nothing for now."
DocImpact: user-guide, operations-guide
Closes-Bug: #1479815 fbc67249b9e9633 c8aab5e09ca a94de77b60fd594 f5bcb29e05)
Closes-Bug: #1487517
Change-Id: I9dec06fc08dbee
(cherry picked from commit 2707a5ebbff7012