nodes joining rabbitmq cluster sometimes hang
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
openstack-ansible |
Fix Released
|
Undecided
|
Darren Birkett | ||
Liberty |
Fix Committed
|
Undecided
|
Darren Birkett | ||
Mitaka |
Fix Committed
|
Undecided
|
Darren Birkett | ||
Trunk |
Fix Released
|
Undecided
|
Darren Birkett |
Bug Description
Sometimes when secondary/tertiary nodes join the primary node and try and form a cluster, the join hangs indefinitely and ends up being killed by the gate job timeout.
Since this is not easily reproducible, it may take some time to track down. The OSA gate logs look like this:
2016-04-20 18:39:37.738 | TASK: [{{ rolename | basename }} | Join rabbitmq cluster] *******
2016-04-20 18:39:37.782 | skipping: [container1]
2016-04-20 18:39:39.160 | changed: [container3]
2016-04-20 19:32:52.965 | Build timed out (after 60 minutes). Marking the build as failed.
2016-04-20 19:32:53.020 | Build was aborted
2016-04-20 19:32:53.020 | [SCP] Copying console log.
2016-04-20 19:32:53.496 | [SCP] Trying to create /srv/static/
2016-04-20 19:32:53.544 | [SCP] Trying to create /srv/static/
2016-04-20 19:32:53.590 | Finished: FAILURE
My current working theory is that when the joining nodes attempt to join the first node and form a cluster, they are all doing it at exactly the same time. This is possibly causing a race where they attempt to sync from each other, but because they have not synced from the master yet they have database mnesia inconsistencies and the join fails (it shouldn't hang, but you know, rabbitmq).
Probably need a separate play for the primary node, then a separate play for joiners set to use serial so they join one at a time. Or something else that makes them join one at a time (you can't just serialise one task apparently).
Will update more if I confirm this theory