Bug #1364040 “Two 'primary' controllers created on HA environmen...” : Bugs : Fuel for OpenStack

Revision history for this message

Egor Kotko (ykotko) wrote on 2014-09-01:

#1

fail_error_ha_flat_scalability-2014_08_31__03_21_31.tar.gz Edit (8.4 MiB, application/x-tar)

Nastya Urlapova (aurlapova) on 2014-09-01

Changed in fuel:
importance:	Medium → High
assignee:	nobody → Fuel Library Team (fuel-library)
milestone:	6.0 → 5.1

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-09-02:

#2

This is sporadic corosync failure. This happens sometimes due to corosync bugs. This will be fixed in future releases by corosync update to 2.x or migration to CMAN.

Changed in fuel:
importance:	High → Medium
status:	New → Confirmed
milestone:	5.1 → 6.0

Egor Kotko (ykotko) on 2014-09-03

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
milestone:	6.0 → 5.1

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-09-03:

#3

The pastev looks very strange, because we removed all CIB shadows and commits. If reproducers confirm, had to fix corosync provider

Changed in fuel:
milestone:	5.1 → 6.0
assignee:	Bogdan Dobrelya (bogdando) → Fuel Library Team (fuel-library)

Revision history for this message

Stanislaw Bogatkin (sbogatkin) wrote on 2014-09-03:

#4

Cannot reproduce after 3 tries.

Revision history for this message

Egor Kotko (ykotko) wrote on 2014-09-10:

#5

Reproduced again
http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.centos.thread_4/159/testReport/%28root%29/ha_flat_scalability/ha_flat_scalability/

Revision history for this message

Egor Kotko (ykotko) wrote on 2014-09-10:

#6

fuel-snapshot-2014-09-10_04-30-25.tgz Edit (8.6 MiB, application/x-tar)

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-10-07:

#7

http://paste.openstack.org/show/119390/
http://jenkins-product.srt.mirantis.net:8080/view/5.1_swarm/job/5.1_fuelmain.system_test.ubuntu.thread_4/14/

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "5.1.1"
  api: "1.0"
  build_number: "20"
  build_id: "2014-10-05_00-00-10"
  astute_sha: "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13"
  fuellib_sha: "46ad455514614ec2600314ac80191e0539ddfc04"
  ostf_sha: "64cb59c681658a7a55cc2c09d079072a41beb346"
  nailgun_sha: "eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d"
  fuelmain_sha: "ce6a2871734bb40e09a6f61e9d007bb7e324fada"

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-10-09:

#8

superseded by https://blueprints.launchpad.net/fuel/+spec/pacemaker-improvements

Changed in fuel:
status:	Confirmed → Invalid

Revision history for this message

Egor Kotko (ykotko) wrote on 2014-10-29:

#9

Reproduced on:
{"build_id": "2014-10-28_00-01-12", "ostf_sha": "f47fd1d66a7255213ee075d5c11b8f111f922000", "build_number": "53", "auth_required": true, "api": "1.0", "nailgun_sha": "fb18068382d522b735ecf446c0f4166c129269fb", "production": "docker", "fuelmain_sha": "f3ad22d12c26794a05e62d46317fa1e47f7f1138", "astute_sha": "97eea90efe0a1f17b4934919d6e459d270c10372", "feature_groups": ["mirantis", "techpreview"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-10-28_00-01-12", "ostf_sha": "f47fd1d66a7255213ee075d5c11b8f111f922000", "build_number": "53", "api": "1.0", "nailgun_sha": "fb18068382d522b735ecf446c0f4166c129269fb", "production": "docker", "fuelmain_sha": "f3ad22d12c26794a05e62d46317fa1e47f7f1138", "astute_sha": "97eea90efe0a1f17b4934919d6e459d270c10372", "feature_groups": ["mirantis", "techpreview"], "release": "6.0", "fuellib_sha": "b8d244a900b25bed8f597e99b309f9ee4ad8ae56"}}}, "fuellib_sha": "b8d244a900b25bed8f597e99b309f9ee4ad8ae56"}

Revision history for this message

Egor Kotko (ykotko) wrote on 2014-10-29:

#10

fuel-snapshot-2014-10-29_00-15-11.tgz Edit (12.5 MiB, application/x-tar)

Changed in fuel:
status:	Invalid → New

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-10-30:

#11

Should be fixed with bp pacemaker-improvements by Corosync 2.x upgrading

Changed in fuel:
status:	New → Triaged

Sergii Golovatiuk (sgolovatiuk) on 2014-11-01

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-11-17:

#12

superseded by https://blueprints.launchpad.net/fuel/+spec/pacemaker-improvements

Changed in fuel:
status:	Triaged → Invalid

Revision history for this message

Egor Kotko (ykotko) wrote on 2014-11-19:

#13

Fail on this test again with error in puppet log:
http://paste.openstack.org/show/134777/

{"build_id": "2014-11-18_22-00-23", "ostf_sha": "82465a94eed4eff1fc8d8e1f2fb7e9993c22f068", "build_number": "114", "auth_required": true, "api": "1.0", "nailgun_sha": "b0add09c4361fee8fc70637c9a6ef42fbe738abe", "production": "docker", "fuelmain_sha": "e556f0e1b00c30ec5c4b374ca2878c047c8686c2", "astute_sha": "65eb911c38afc0e23d187772f9a05f703c685896", "feature_groups": ["mirantis"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-11-18_22-00-23", "ostf_sha": "82465a94eed4eff1fc8d8e1f2fb7e9993c22f068", "build_number": "114", "api": "1.0", "nailgun_sha": "b0add09c4361fee8fc70637c9a6ef42fbe738abe", "production": "docker", "fuelmain_sha": "e556f0e1b00c30ec5c4b374ca2878c047c8686c2", "astute_sha": "65eb911c38afc0e23d187772f9a05f703c685896", "feature_groups": ["mirantis"], "release": "6.0", "fuellib_sha": "5a5275370b33ab3b9a403728a1c7ad173289e4a0"}}}, "fuellib_sha": "5a5275370b33ab3b9a403728a1c7ad173289e4a0"}

Egor Kotko (ykotko) on 2014-11-19

Changed in fuel:
status:	Invalid → New

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2014-11-19:

#14

Download full text (4.8 KiB)

Seems that the corosync logic does not work right when the HA cluster has only one controller.
In this job : http://jenkins-product.srt.mirantis.net:8080/view/6.0_swarm/job/6.0_fuelmain.system_test.ubuntu.thread_4/26/ ,
on the node-1: pacemaker.log has the following when node-2 has joined (pay attention to the messages like: "error: clone_color: ....:0 is running on node-1 which isn't allowed"):

============= node-1: pacemaker.log
<30>Nov 19 13:25:30 node-1 corosync[8627]: [pcmk ] update_member info: update_member: 0xb9d040 Node 56978442 (node-2) born on: 880
<29>Nov 19 13:25:30 node-1 cib[8654]: notice: cib:diff: Diff: --- 0.41.17
<30>Nov 19 13:25:30 node-1 corosync[8627]: [pcmk ] update_member info: update_member: 0xb9d040 Node 56978442 now known as node-2 (was: (null))
<29>Nov 19 13:25:30 node-1 cib[8654]: notice: cib:diff: Diff: +++ 0.42.1 5bfc08455ff701ef05a3884260f7a29b
<30>Nov 19 13:25:30 node-1 corosync[8627]: [pcmk ] update_member info: update_member: Node node-2 now has process list: 00000000000000000000000000111312(1118994)
<29>Nov 19 13:25:30 node-1 crmd[8659]: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]
<29>Nov 19 13:25:30 node-1 cib[8654]: notice: cib:diff: Diff: --- 0.42.2
<30>Nov 19 13:25:30 node-1 corosync[8627]: [pcmk ] update_member info: update_member: Node node-2 now has 1 quorum votes (was 0)
<29>Nov 19 13:25:30 node-1 cib[8654]: notice: cib:diff: Diff: +++ 0.43.1 5490af35dd7d29f29cc94eb736a86c60
<30>Nov 19 13:25:31 node-1 corosync[8627]: [pcmk ] update_expected_votes info: update_expected_votes: Expected quorum votes 2 -> 3
<30>Nov 19 13:25:31 node-1 corosync[8627]: [pcmk ] send_member_notification info: send_member_notification: Sending membership update 880 to 2 children
<30>Nov 19 13:25:31 node-1 corosync[8627]: [CPG ] downlist_log chosen downlist: sender r(0) ip(10.108.101.4) ; members(old:2 left:0)
<29>Nov 19 13:25:31 node-1 corosync[8627]: [MAIN ] corosync_sync_completed Completed service synchronization, ready to provide service.

...

<29>Nov 19 13:25:34 node-1 pengine[8658]: notice: unpack_config: On loss of CCM Quorum: Ignore
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: common_apply_stickiness: vip__public[node-2] = -1000000
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: common_apply_stickiness: vip__public[node-3] = -1000000
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: common_apply_stickiness: ping_vip__public:0[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: common_apply_stickiness: vip__management[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: common_apply_stickiness: p_heat-engine:0[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: common_apply_stickiness: p_rabbitmq-server:0[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: common_apply_stickiness: p_mysql:0[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: common_apply_stickiness: p_haproxy:0[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: clone_color: p_haproxy:0 is running on node-1 which isn't all...

Seems that the  corosync logic does not work right when the HA cluster has only one controller.
In this job : http://jenkins-product.srt.mirantis.net:8080/view/6.0_swarm/job/6.0_fuelmain.system_test.ubuntu.thread_4/26/ ,
on the node-1: pacemaker.log has the following when node-2 has joined (pay attention to the messages like: "error: clone_color: ....:0 is running on node-1 which isn't allowed"):

============= node-1: pacemaker.log
<30>Nov 19 13:25:30 node-1 corosync[8627]:   [pcmk  ] update_member info: update_member: 0xb9d040 Node 56978442 (node-2) born on: 880
<29>Nov 19 13:25:30 node-1 cib[8654]:   notice: cib:diff: Diff: --- 0.41.17
<30>Nov 19 13:25:30 node-1 corosync[8627]:   [pcmk  ] update_member info: update_member: 0xb9d040 Node 56978442 now known as node-2 (was: (null))
<29>Nov 19 13:25:30 node-1 cib[8654]:   notice: cib:diff: Diff: +++ 0.42.1 5bfc08455ff701ef05a3884260f7a29b
<30>Nov 19 13:25:30 node-1 corosync[8627]:   [pcmk  ] update_member info: update_member: Node node-2 now has process list: 00000000000000000000000000111312(1118994)
<29>Nov 19 13:25:30 node-1 crmd[8659]:   notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]
<29>Nov 19 13:25:30 node-1 cib[8654]:   notice: cib:diff: Diff: --- 0.42.2
<30>Nov 19 13:25:30 node-1 corosync[8627]:   [pcmk  ] update_member info: update_member: Node node-2 now has 1 quorum votes (was 0)
<29>Nov 19 13:25:30 node-1 cib[8654]:   notice: cib:diff: Diff: +++ 0.43.1 5490af35dd7d29f29cc94eb736a86c60
<30>Nov 19 13:25:31 node-1 corosync[8627]:   [pcmk  ] update_expected_votes info: update_expected_votes: Expected quorum votes 2 -> 3
<30>Nov 19 13:25:31 node-1 corosync[8627]:   [pcmk  ] send_member_notification info: send_member_notification: Sending membership update 880 to 2 children
<30>Nov 19 13:25:31 node-1 corosync[8627]:   [CPG   ] downlist_log chosen downlist: sender r(0) ip(10.108.101.4) ; members(old:2 left:0)
<29>Nov 19 13:25:31 node-1 corosync[8627]:   [MAIN  ] corosync_sync_completed Completed service synchronization, ready to provide service.

...

<29>Nov 19 13:25:34 node-1 pengine[8658]:   notice: unpack_config: On loss of CCM Quorum: Ignore
<27>Nov 19 13:25:34 node-1 pengine[8658]:    error: common_apply_stickiness: vip__public[node-2] = -1000000
<27>Nov 19 13:25:34 node-1 pengine[8658]:    error: common_apply_stickiness: vip__public[node-3] = -1000000
<27>Nov 19 13:25:34 node-1 pengine[8658]:    error: common_apply_stickiness: ping_vip__public:0[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]:    error: common_apply_stickiness: vip__management[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]:    error: common_apply_stickiness: p_heat-engine:0[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]:    error: common_apply_stickiness: p_rabbitmq-server:0[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]:    error: common_apply_stickiness: p_mysql:0[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]:    error: common_apply_stickiness: p_haproxy:0[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]:    error: clone_color: p_haproxy:0 is running on node-1 which isn't allowed
<27>Nov 19 13:25:34 node-1 pengine[8658]:    error: clone_color: ping_vip__public:0 is running on node-1 which isn't allowed
<27>Nov 19 13:25:35 node-1 pengine[8658]:    error: clone_color: p_heat-engine:0 is running on node-1 which isn't allowed
<27>Nov 19 13:25:35 node-1 pengine[8658]:    error: clone_color: p_rabbitmq-server:0 is running on node-1 which isn't allowed
<27>Nov 19 13:25:35 node-1 pengine[8658]:    error: clone_color: p_mysql:0 is running on node-1 which isn't allowed
<29>Nov 19 13:25:35 node-1 pengine[8658]:   notice: LogActions: Stop    vip__public     (node-1)
<29>Nov 19 13:25:35 node-1 pengine[8658]:   notice: LogActions: Stop    ping_vip__public:0      (node-1)
<29>Nov 19 13:25:35 node-1 pengine[8658]:   notice: LogActions: Start   ping_vip__public:1      (node-2)
<29>Nov 19 13:25:35 node-1 pengine[8658]:   notice: LogActions: Move    vip__management (Started node-1 -> node-2)
<29>Nov 19 13:25:35 node-1 pengine[8658]:   notice: LogActions: Stop    p_heat-engine:0 (node-1)
<29>Nov 19 13:25:35 node-1 pengine[8658]:   notice: LogActions: Start   p_heat-engine:1 (node-2)
<29>Nov 19 13:25:35 node-1 pengine[8658]:   notice: LogActions: Demote  p_rabbitmq-server:0     (Master -> Stopped node-1)
<29>Nov 19 13:25:35 node-1 pengine[8658]:   notice: LogActions: Start   p_rabbitmq-server:1     (node-2)
<29>Nov 19 13:25:36 node-1 pengine[8658]:   notice: LogActions: Stop    p_mysql:0       (node-1)
<29>Nov 19 13:25:36 node-1 pengine[8658]:   notice: LogActions: Start   p_mysql:1       (node-2)
<29>Nov 19 13:25:36 node-1 pengine[8658]:   notice: LogActions: Stop    p_haproxy:0     (node-1)
<29>Nov 19 13:25:36 node-1 pengine[8658]:   notice: LogActions: Start   p_haproxy:1     (node-2)

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2014-11-19:

#15

This scalability test contains several steps.
On the first step, HA cluster is deploing with only one controller.
When the cluster is ready, on the second step two more controllers is adding to the cluster and deploy starts again.

Looks like on the second step 'astute' configures an additional primary controller from those two additional controllers that (possible) causes a resource collision in corosync.

According to the astute.log on the master node, the first 'primary' controller was 'node-2':
======= astute.log
2014-11-19T12:44:09 debug: [417] Process message from worker queue:
...
\"nodes\": [{\"swift_zone\": \"2\", \"uid\": \"2\", \"public_address\": \"10.108.100.3\", \"internal_netmask\": \"255.255.255.0\", \"fqdn\": \"node-2.test.domain.local\", \"role\": \"primary-controller\"

And then the second 'primary' controller 'node-1' was added:
======= astute.log
2014-11-19T13:03:43 debug: [397] Process message from worker queue:
...
\"nodes\": [{\"swift_zone\": \"1\", \"uid\": \"1\", \"public_address\": \"10.108.100.4\", \"internal_netmask\": \"255.255.255.0\", \"fqdn\": \"node-1.test.domain.local\", \"role\": \"primary-controller\"

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2014-11-19:

#16

fail_error_ha_flat_scalability-2014_11_19__13_56_50.tar.gz Edit (6.8 MiB, application/x-tar)

Changed in fuel:
assignee:	Sergii Golovatiuk (sgolovatiuk) → Fuel Python Team (fuel-python)

Revision history for this message

Dima Shulyak (dshulyak) wrote on 2014-11-19:

#17

Exactly the same scalability setup worked for me, so this is definitely sporadic corosync/pacemaker issue.

In my opinion puppet apply with primary-controller role should be idempotent operation for any node in cluster, important part
is - this node should be deployed first.

Changed in fuel:
assignee:	Fuel Python Team (fuel-python) → Fuel Library Team (fuel-library)

Dennis Dmitriev (ddmitriev) on 2014-11-19

summary:

- Deployment ha_flat_scalability finished with errors in puppet log
+ Two 'primary' controllers created on HA environment when additional
+ controllers added to a cluster

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2014-11-19:

#18

Dima, why 'astute' provides to the 'puppet' "node-2.test.domain.local" with "role": "primary-controller" , and then "node-1.test.domain.local" with "role": "primary-controller" ?

This issue is constantly reproduced on CI and on my workstation. Could you provide the diagnostic snapshot and Fuel version so that we could analyze and compare it.

{u'build_id': u'2014-11-18_22-00-23', u'ostf_sha': u'82465a94eed4eff1fc8d8e1f2fb7e9993c22f068', u'build_number': u'114', u'auth_required': True, u'nailgun_sha': u'b0add09c4361fee8fc70637c9a6ef42fbe738abe', u'production': u'docker', u'api': u'1.0', u'fuelmain_sha': u'e556f0e1b00c30ec5c4b374ca2878c047c8686c2', u'astute_sha': u'65eb911c38afc0e23d187772f9a05f703c685896', u'feature_groups': [u'mirantis'], u'release': u'6.0', u'release_versions': {u'2014.2-6.0': {u'VERSION': {u'build_id': u'2014-11-18_22-00-23', u'ostf_sha': u'82465a94eed4eff1fc8d8e1f2fb7e9993c22f068', u'build_number': u'114', u'api': u'1.0', u'nailgun_sha': u'b0add09c4361fee8fc70637c9a6ef42fbe738abe', u'production': u'docker', u'fuelmain_sha': u'e556f0e1b00c30ec5c4b374ca2878c047c8686c2', u'astute_sha': u'65eb911c38afc0e23d187772f9a05f703c685896', u'feature_groups': [u'mirantis'], u'release': u'6.0', u'fuellib_sha': u'5a5275370b33ab3b9a403728a1c7ad173289e4a0'}}}, u'fuellib_sha': u'5a5275370b33ab3b9a403728a1c7ad173289e4a0'}

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2014-11-19:

#19

Also as for chronology:
First deployed node was node-2 as primary contorller ( time frame ~12:40 ... 13:00 )
Second and third deployed nodes were node-2 as primary contorller and node-3 as slave (time frame ~13:05 ... 13:25 )

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2014-11-19:

#20

upd: Second and third deployed nodes were node-1 and node-3

Fuel for OpenStack

Two 'primary' controllers created on HA environment when additional controllers added to a cluster

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches