Bug #1573030 “nodes joining rabbitmq cluster sometimes hang” : Series liberty : Bugs : openstack-ansible

Revision history for this message

Darren Birkett (darren-birkett) wrote on 2016-04-22:

#1

My current working theory is that when the joining nodes attempt to join the first node and form a cluster, they are all doing it at exactly the same time. This is possibly causing a race where they attempt to sync from each other, but because they have not synced from the master yet they have database mnesia inconsistencies and the join fails (it shouldn't hang, but you know, rabbitmq).

Probably need a separate play for the primary node, then a separate play for joiners set to use serial so they join one at a time. Or something else that makes them join one at a time (you can't just serialise one task apparently).

Will update more if I confirm this theory

Changed in openstack-ansible:
assignee:	nobody → Darren Birkett (darren-birkett)

Revision history for this message

Darren Birkett (darren-birkett) wrote on 2016-04-25:

#2

So I managed to prove the theory - joining nodes, when left to all join at the same time (as is the current implementation) are often hitting a situation where they try and sync from each other when neither is actually synced yet. This causes an mnesia db issue, and the join hangs. rabbit doesn't seem to have a similar implementation to a galera cluster, where a 'donor' node has to be fully synced before it can be a donor. It seems that a joining node can pick another joining node to sync from even though it's not itself synced yet.

To prove this I added serial=1 to the playbook that calls the role, and moved the testing out into a separate play (so that it wasn't getting called at the end of each serial run of the playbook/role which would fail). Running a loop of destroying containers and then running the tests over and over again on 10 separate nodes for 4 days straight, I have not seen a single instance of a cluster join hang.

I'll post a review up soon and the implementation can be debated (it's a bit dirty in the test plays at least because I needed to hardcode the path) to the role default vars.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-27: Fix proposed to openstack-ansible-rabbitmq_server (master)

#3

Fix proposed to branch: master
Review: https://review.openstack.org/310542

Changed in openstack-ansible:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-05: Fix proposed to openstack-ansible (master)

#4

Fix proposed to branch: master
Review: https://review.openstack.org/313168

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-09: Fix merged to openstack-ansible (master)

#5

Reviewed: https://review.openstack.org/313168
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=21c24789019bf1ee85347356fe6c70b0b9045d7c
Submitter: Jenkins
Branch: master

commit 21c24789019bf1ee85347356fe6c70b0b9045d7c
Author: Darren Birkett <email address hidden>
Date: Thu May 5 22:17:11 2016 +0100

install rabbitmq-server in serial

    In order to prevent race conditions with nodes joining the cluster
    simultaneously when the cluster is first formed, we move the rabbitmq
    installation play to be 'serial: 1'. However, when the nodes are being
    upgraded, it cannot be done in serial so in this case we set 'serial: 0'

    There are some tasks/roles called in this playbook that can still be
    run in parallel, so we split out the rabbitmq-server install into a
    separate play so that we only serialise the parts that are necessary
    to ensure maximum efficiency.

Change-Id: I97cdae27fdce4f400492c2134b6589b55fbc5a61
Fixes-Bug: #1573030

Changed in openstack-ansible:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-10: Fix proposed to openstack-ansible (liberty)

#6

Fix proposed to branch: liberty
Review: https://review.openstack.org/314457

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-10: Fix proposed to openstack-ansible (stable/mitaka)

#7

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/314458

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-10: Fix merged to openstack-ansible-rabbitmq_server (master)

#8

Reviewed: https://review.openstack.org/310542
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-rabbitmq_server/commit/?id=5dc67955f0ac08a7c9719d641e9828558620da89
Submitter: Jenkins
Branch: master

commit 5dc67955f0ac08a7c9719d641e9828558620da89
Author: Darren Birkett <email address hidden>
Date: Thu May 5 22:36:34 2016 +0100

install rabbitmq-server in serial

    In order to prevent race conditions with nodes joining the cluster
    simultaneously when the cluster is first formed, we move the rabbitmq
    installation play to be 'serial: 1'. However, when the nodes are being
    upgraded, it cannot be done in serial so in this case we set 'serial: 0'

    The tests are removed from a post_task include in the install play, and
    moved to their own play as they need to be run after the entire cluster
    has been formed. As well as moving a few generic vars into the
    test-vars.yml include, we also pass in the specific version of rabbitmq
    to be tested against in the test play.

Fixes-Bug: #1573030

Change-Id: Id119ff9f20ddfd8e1f29598c8c5ce862d2e7fab4

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-10: Fix merged to openstack-ansible (stable/mitaka)

#9

Reviewed: https://review.openstack.org/314458
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=0861feaf36bf2f1f6c52a8587cbf2b2dcbade65e
Submitter: Jenkins
Branch: stable/mitaka

commit 0861feaf36bf2f1f6c52a8587cbf2b2dcbade65e
Author: Darren Birkett <email address hidden>
Date: Thu May 5 22:17:11 2016 +0100

install rabbitmq-server in serial

    In order to prevent race conditions with nodes joining the cluster
    simultaneously when the cluster is first formed, we move the rabbitmq
    installation play to be 'serial: 1'. However, when the nodes are being
    upgraded, it cannot be done in serial so in this case we set 'serial: 0'

    There are some tasks/roles called in this playbook that can still be
    run in parallel, so we split out the rabbitmq-server install into a
    separate play so that we only serialise the parts that are necessary
    to ensure maximum efficiency.

    Change-Id: I97cdae27fdce4f400492c2134b6589b55fbc5a61
    Fixes-Bug: #1573030
    (cherry picked from commit 21c24789019bf1ee85347356fe6c70b0b9045d7c)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-10: Fix merged to openstack-ansible (liberty)

#10

Reviewed: https://review.openstack.org/314457
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=068e948b6143b33b988d91117771c9e834ec7766
Submitter: Jenkins
Branch: liberty

commit 068e948b6143b33b988d91117771c9e834ec7766
Author: Darren Birkett <email address hidden>
Date: Tue May 10 09:11:11 2016 +0100

install rabbitmq-server in serial

    In order to prevent race conditions with nodes joining the cluster
    simultaneously when the cluster is first formed, we move the rabbitmq
    installation play to be 'serial: 1'. However, when the nodes are being
    upgraded, it cannot be done in serial so in this case we set 'serial: 0'

    There are some tasks/roles called in this playbook that can still be
    run in parallel, so we split out the rabbitmq-server install into a
    separate play so that we only serialise the parts that are necessary
    to ensure maximum efficiency.

Based on commit: 21c24789019bf1ee85347356fe6c70b0b9045d7c

Change-Id: I97cdae27fdce4f400492c2134b6589b55fbc5a61
Fixes-Bug: #1573030

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-27: Fix proposed to openstack-ansible-rabbitmq_server (stable/mitaka)

#11

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/322014

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-27: Fix merged to openstack-ansible-rabbitmq_server (stable/mitaka)

#12

Reviewed: https://review.openstack.org/322014
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-rabbitmq_server/commit/?id=f44e61a291bca62cf3154bb5f3d4e4f619f80745
Submitter: Jenkins
Branch: stable/mitaka

commit f44e61a291bca62cf3154bb5f3d4e4f619f80745
Author: Darren Birkett <email address hidden>
Date: Thu May 5 22:36:34 2016 +0100

install rabbitmq-server in serial

    In order to prevent race conditions with nodes joining the cluster
    simultaneously when the cluster is first formed, we move the rabbitmq
    installation play to be 'serial: 1'. However, when the nodes are being
    upgraded, it cannot be done in serial so in this case we set 'serial: 0'

    The tests are removed from a post_task include in the install play, and
    moved to their own play as they need to be run after the entire cluster
    has been formed. As well as moving a few generic vars into the
    test-vars.yml include, we also pass in the specific version of rabbitmq
    to be tested against in the test play.

Fixes-Bug: #1573030

Change-Id: Id119ff9f20ddfd8e1f29598c8c5ce862d2e7fab4
(cherry picked from commit 5dc67955f0ac08a7c9719d641e9828558620da89)

	Status	Importance	Assigned to
openstack-ansible	Fix Released	Undecided	Darren Birkett
Liberty	Fix Committed	Undecided	Darren Birkett
Mitaka	Fix Committed	Undecided	Darren Birkett
Trunk	Fix Released	Undecided	Darren Birkett

openstack-ansible

nodes joining rabbitmq cluster sometimes hang

Bug Description

Other bug subscribers

Remote bug watches