Bug #1383247 “[system-tests]Fix fuelweb_tests for RabbitMQ HA fu...” : Bugs : Fuel for OpenStack

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2014-10-20:

#1

fail_error_ceph_ha_restart-2014_10_20__10_05_49.tar.gz Edit (12.6 MiB, application/x-tar)

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2014-10-23:

#2

Reproduced on CI test: http://jenkins-product.srt.mirantis.net:8080/view/5.1_swarm/job/5.1_fuelmain.system_test.centos.thread_3/29/console

In fact this issue is connected with rabbitmq long starting:

=============== node-1.test.domain.local/cinder-volume.log =======================
2014-10-22T18:19:02.878091+01:00 err: 2014-10-22 17:19:02.850 3403 ERROR oslo.messaging._drivers.impl_rabbit [req-e2d7f6ca-5b6a-4838-a4c7-43370fa25bde - -- - -] AMQP server on 127.0.0.1:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 30 seconds.
======================================

Unfortunately, there is no pacemaker logs in the diagnostic snapshot so it is hard to investigate what happened.

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2014-10-23:

#3

Same issue with rabbitmq long starting on CI test http://jenkins-product.srt.mirantis.net:8080/view/5.1_swarm/job/5.1_fuelmain.system_test.centos.thread_5/28/console , test name 'deploy_ha_neutron'.

rabbitmq started on several minutes later than OSTF, so test failed because of failed rabbitmq ostf check.

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2014-10-23:

#4

Download full text (4.7 KiB)

RabbitMQ is assembled into cluster by pacemaker in several separate stages ('start' to checking Mnesia database consistensy, 'pre-promote', 'promote' and 'post-promote' to choose the Master and join other nodes to it).

Pacemaker runs each stage for 'rabbitmq' resource together with other resources ('heat' and 'mysql'), and goes to the next stage only when all resources are processed in the current stage, one-by-one.

We often facing broken galera cluster that takes a long time when restoring the cluster.

The script /usr/lib/ocf/resource.d/mirantis/mysql-wss on the controller consumes for about 7 minutes for every try to start the galera, not allowing pacemeker to process other resources. This leads to about a seven-minute period between processing the 'rabbitmq' stages.

Taking into account other resources, we have to wait for about 20 minutes before RabbitMQ will be functional ('start' ... 10 minutes wait for others ... 'promote to master' ... 10 minutes wait for others ... 'join to cluster and allow access to rabbitmq').

Unfortunatelly, logging from 'mysql-wss' is broken, so here is output of mysql-wss script started manually:

================================================================================
[root@node-2 mirantis]# date
Thu Oct 23 18:24:23 UTC 2014

[root@node-2 mirantis]# OCF_ROOT=/usr/lib/ocf/ /usr/lib/ocf/resource.d/mirantis/mysql-wss start
INFO: mysql_status: ====================== i = 1 ; sleeptime = 5
INFO: PIDFile /var/run/mysql/mysqld.pid of MySQL server not found. Sleeping for 5 seconds. 0 retries left
INFO: MySQL is not running
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Checking if galera primary controller
INFO: GTID OK: 96cc782e-5aa0-11e4-b985-066f5a65a8fa:24134
INFO: GTID OK: 96cc782e-5aa0-11e4-b985-066f5a65a8fa:24114
INFO: GTID OK: 96cc782e-5aa0-11e4-b985-066f5a65a8fa:24209
INFO: Possible masters: node-4.test.domain.local
INFO: Choosed master: node-4.test.domain.local
date
INFO: Waiting for master. 300 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 270 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 240 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 210 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 180 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 150 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 120 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 90 seconds to go
Resource 'default' not found: No such device or address
Err...

RabbitMQ is assembled into cluster by pacemaker in several separate stages ('start' to checking Mnesia database consistensy, 'pre-promote', 'promote' and 'post-promote'  to choose the Master and join other nodes to it).

Pacemaker runs each stage for 'rabbitmq' resource together with other resources ('heat' and 'mysql'), and goes to the next stage only when all resources are processed in the current stage, one-by-one.

We often facing broken galera cluster that takes a long time when restoring the cluster.

The script /usr/lib/ocf/resource.d/mirantis/mysql-wss on the controller consumes for about 7 minutes for every try to start the galera, not allowing pacemeker to process other resources. This leads to about a seven-minute period between processing the 'rabbitmq' stages.

Taking into account other resources, we have to wait for about 20 minutes before RabbitMQ will be functional ('start' ... 10 minutes wait for others ... 'promote to master' ... 10 minutes wait for others ... 'join to cluster and allow access to rabbitmq').

Unfortunatelly, logging from 'mysql-wss' is broken, so here is output of mysql-wss script started manually:

================================================================================
[root@node-2 mirantis]# date
Thu Oct 23 18:24:23 UTC 2014

[root@node-2 mirantis]# OCF_ROOT=/usr/lib/ocf/ /usr/lib/ocf/resource.d/mirantis/mysql-wss start
INFO:  mysql_status: ====================== i = 1 ; sleeptime = 5
INFO: PIDFile /var/run/mysql/mysqld.pid of MySQL server not found. Sleeping for 5 seconds. 0 retries left
INFO: MySQL is not running
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Checking if galera primary controller
INFO: GTID OK: 96cc782e-5aa0-11e4-b985-066f5a65a8fa:24134
INFO: GTID OK: 96cc782e-5aa0-11e4-b985-066f5a65a8fa:24114
INFO: GTID OK: 96cc782e-5aa0-11e4-b985-066f5a65a8fa:24209
INFO: Possible masters:  node-4.test.domain.local
INFO: Choosed master: node-4.test.domain.local
date
INFO: Waiting for master. 300 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 270 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 240 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 210 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 180 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 150 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 120 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 90 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 60 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 30 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: GTID OK: 96cc782e-5aa0-11e4-b985-066f5a65a8fa:24591
INFO: GTID OK: 96cc782e-5aa0-11e4-b985-066f5a65a8fa:24809
INFO: GTID OK: 96cc782e-5aa0-11e4-b985-066f5a65a8fa:24905
INFO: Possible masters:  node-4.test.domain.local
INFO: Choosed master: node-4.test.domain.local
INFO:  mysql_status: ====================== i = 3 ; sleeptime = 5
INFO: MySQL not running: removing old PID file
INFO:  mysql_status: ====================== i = 3 ; sleeptime = 5
INFO: PIDFile /var/run/mysql/mysqld.pid of MySQL server not found. Sleeping for 5 seconds. 2 retries left
INFO: PIDFile /var/run/mysql/mysqld.pid of MySQL server not found. Sleeping for 5 seconds. 1 retries left
INFO: PIDFile /var/run/mysql/mysqld.pid of MySQL server not found. Sleeping for 5 seconds. 0 retries left
INFO: MySQL is not running
INFO:  mysql_status: ====================== i = 3 ; sleeptime = 5
INFO: PIDFile /var/run/mysql/mysqld.pid of MySQL server not found. Sleeping for 5 seconds. 2 retries left
INFO: PIDFile /var/run/mysql/mysqld.pid of MySQL server not found. Sleeping for 5 seconds. 1 retries left
INFO: MySQL started

[root@node-2 mirantis]# date
Thu Oct 23 18:31:53 UTC 2014
================================================================================

Dennis Dmitriev (ddmitriev) on 2014-10-24

summary:

- Cinder services are down after cold restart all controllers
+ RabbitMQ is started for a very long time in HA

Dennis Dmitriev (ddmitriev) on 2014-10-27

Changed in fuel:
assignee:	Fuel QA Team (fuel-qa) → Fuel Library Team (fuel-library)

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-10-28: Re: RabbitMQ is started for a very long time in HA

#5

reproduced
http://jenkins-product.srt.mirantis.net:8080/view/5.1_swarm/job/5.1_fuelmain.system_test.centos.ha_neutron_destructive/15/testReport/junit/%28root%29/ha_disconnect_controllers/ha_disconnect_controllers/

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-10-28:

#6

I believe the proper check for
5. Check cinder services
could be
5. Check fuel health --env X --check HA

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-10-28:

#7

> Unfortunately, there is no pacemaker logs in the diagnostic snapshot so it is hard to investigate what happened.

check /var/log/remote/node*/rabbitmq-server.log for cluster reassembling events from pacemaker.
Other log from corosync and pacemaker are located under /var/log/node*/crmd , lrmd, attrd, cibadmin etc.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-10-28:

#8

For logs attached in #1, you can inspect ./node-{1,2,4}.test.domain.local/lrmd.log for rabbitmq reassembling events.
Then it is done, there should be a messages in logs like 'INFO: p_rabbitmq-server: get_monitor(): rabbit app is running and is member of healthy cluster'

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-10-28:

#9

Looks like the test case described in the bug was not performed well: http://pastebin.com/jZ5UM6qb (from logs in #1)

As you can see, the full reboot AND cluster reassemble verify period was less than 5 minutes, and logs snapshot was taken too early - before the cluster managed to reassemble.

The correct check should:
1) measure time-to-reassemble from the moment of time then rebooting has been finished, instead of then it was initiated.
2) measure time-to-reassemble for any given node between the moments then corosync started and the time stamp of the nearest 'rabbit app is running and is member of healthy cluster' event.

tags:

added: to-be-covered-by-tests

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-10-28:

#10

Please update the verification steps and re-submit the correctly taken logs snapshot

Changed in fuel:
status:	Confirmed → Incomplete
importance:	Medium → High

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2014-10-29:

#11

We will perform checks for critical services in the following order:

- Wait until MySQL Galera is up on some controller
- Wait until RabbitMQ cluster is up and accept connections
- Wait until Cinder services is up on some controller
- Check Ceph status

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2014-10-29:

#12

https://review.openstack.org/#/c/131742/

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-10-29:

#13

Looks good, but please consider to replace
- Wait until MySQL Galera is up on some controller
- Wait until RabbitMQ cluster is up and accept connections
by OSTF ha test group

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2014-10-29:

#14

As I see RabbitMQ OSTF test performs only "rabbitmqctl cluster_status" check . No functionality is checked by OSTF, so it never covers pacemaker logic concerning assembly rabbitmq cluster.

There are some situations when OSTF doesn't reflect actual rabbitmq status:
- "rabbitmqctl cluster_status" shows that all nodes are running but pacemaker ocf script hasn't opened 5673 port in iptables yet (or there is remained an extra iptables rule that blocks 5673 port);
- "rabbitmqctl cluster_status" shows that all nodes are running but pacemaker is just checking if the rabbitmq starts (start phase) and it is going to shutdown rabbitmq before performing further steps;
- "rabbitmqctl cluster_status" shows that all nodes are running but it is still inaccessible thru haproxy because of haproxy, vip__management, network or any other issue. In this case rabbitmq looks like a nonworking for other services.

We want to make sure that rabbitmq is successfully accembled by pacemaker and ready to serve requests from other services.
The better way would be perform creating some queue and sending some test messages, but it is not realized in OSTF yet.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-10-30:

#15

Submitted related bug https://bugs.launchpad.net/fuel/+bug/1387567

Ok, please stay in touch with OSTF team so they could reuse your code as well.

Bogdan Dobrelya (bogdando) on 2014-10-30

Changed in fuel:
status:	Incomplete → In Progress
assignee:	Fuel Library Team (fuel-library) → Dennis Dmitriev (ddmitriev)

Bogdan Dobrelya (bogdando) on 2014-11-18

summary:

- RabbitMQ is started for a very long time in HA
+ Fix fuelweb_tests for RabbitMQ HA full cluster restart

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-03: Change abandoned on fuel-main (master)

#16

Change abandoned by Dennis Dmitriev (<email address hidden>) on branch: master
Review: https://review.openstack.org/131742
Reason: RabbitMQ check in OSTF will be more powerful then this one, so it is not necessary to make additional custom checks.

Fuel Devops McRobotson (fuel-devops-robot) on 2015-02-22

no longer affects:

fuel/6.0.x

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-03-04: Re: Fix fuelweb_tests for RabbitMQ HA full cluster restart

#17

move to incomplete for 6.0.x according to for now it is not clear how to reproduce it

Nastya Urlapova (aurlapova) on 2015-03-30

summary:

- Fix fuelweb_tests for RabbitMQ HA full cluster restart
+ [system-tests]Fix fuelweb_tests for RabbitMQ HA full cluster restart

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-29: Related fix proposed to fuel-ostf (master)

#18

Related fix proposed to branch: master
Review: https://review.openstack.org/178864

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-30: Fix proposed to fuel-qa (master)

#19

Fix proposed to branch: master
Review: https://review.openstack.org/178966

Bogdan Dobrelya (bogdando) on 2015-05-11

tags:

added: non-release

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-20: Fix merged to fuel-qa (master)

#20

Reviewed: https://review.openstack.org/178966
Committed: https://git.openstack.org/cgit/stackforge/fuel-qa/commit/?id=aa50833aaebc598de37fcc5d617d77f894b569e7
Submitter: Jenkins
Branch: master

commit aa50833aaebc598de37fcc5d617d77f894b569e7
Author: Dennis Dmitriev <email address hidden>
Date: Thu May 14 17:02:13 2015 +0300

Add two methods to wait for cluster HA and OS services ready

    assert_ha_services_ready():
     OSTF 'HA' test group should be used to validate if a cluster
     in the operational state.
     There are rabbitmq and mysql checks, and will be added haproxy
     and pacemaker checks.

Without these services the cluster can fail requests from tests.

    assert_os_services_ready():
     OSTF 'Sanity' test group to wait until OpenStack services are
     ready.

    Change-Id: Ie1bddc965719ca59a143f8f43c53546a4553b1b9
    Closes-Bug: #1383247
    Closes-Bug: #1455910

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2015-06-08:

#21

Moved to invalid as the issue for that version was not updated for more than 3 weeks.

Revision history for this message

Alexey Stupnikov (astupnikov) wrote on 2017-05-29:

#22

MOS5.1 is no longer supported, moving to Won't Fix.

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	High	Dennis Dmitriev	Fuel for OpenStack 6.1
5.1.x	Won't Fix	High	Fuel QA Team	Fuel for OpenStack 5.1.1-updates
6.0.x	Invalid	Undecided	Fuel QA Team	Fuel for OpenStack 6.0-updates
6.1.x	Fix Released	High	Dennis Dmitriev	Fuel for OpenStack 6.1

Fuel for OpenStack

[system-tests]Fix fuelweb_tests for RabbitMQ HA full cluster restart

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches