Two 'primary' controllers created on HA environment when additional controllers added to a cluster

Bug #1364040 reported by Egor Kotko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
New
Medium
Fuel Library (Deprecated)

Bug Description

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "5.1"
  api: "1.0"
  build_number: "490"
  build_id: "2014-08-31_00-01-17"
  astute_sha: "bc60b7d027ab244039f48c505ac52ab8eb0a990c"
  fuellib_sha: "2cfa83119ae90b13a5bac6a844bdadfaf5aeb13f"
  ostf_sha: "4dcd99cc4bfa19f52d4b87ed321eb84ff03844da"
  nailgun_sha: "d25ed02948a8be773e2bd87cfe583ef7be866bb2"
  fuelmain_sha: "109812be3425408dd7be192b5debf109cb1edd4c"

http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.ubuntu.thread_4/150/testReport/%28root%29/ha_flat_scalability/ha_flat_scalability/

Steps to reproduce:
1. Create cluster Ubuntu, HA, flat
2. Add 1 controller node
3. Deploy the cluster
4. Add 2 controller nodes

Actual result:
Deployment finished with errors in puppet-apply.log(node-4) log:
http://paste.openstack.org/show/104257/

Tags: system-tests
Revision history for this message
Egor Kotko (ykotko) wrote :
Changed in fuel:
importance: Medium → High
assignee: nobody → Fuel Library Team (fuel-library)
milestone: 6.0 → 5.1
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This is sporadic corosync failure. This happens sometimes due to corosync bugs. This will be fixed in future releases by corosync update to 2.x or migration to CMAN.

Changed in fuel:
importance: High → Medium
status: New → Confirmed
milestone: 5.1 → 6.0
Egor Kotko (ykotko)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
milestone: 6.0 → 5.1
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The pastev looks very strange, because we removed all CIB shadows and commits. If reproducers confirm, had to fix corosync provider

Changed in fuel:
milestone: 5.1 → 6.0
assignee: Bogdan Dobrelya (bogdando) → Fuel Library Team (fuel-library)
Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

Cannot reproduce after 3 tries.

Revision history for this message
Egor Kotko (ykotko) wrote :
Revision history for this message
Egor Kotko (ykotko) wrote :
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

http://paste.openstack.org/show/119390/
http://jenkins-product.srt.mirantis.net:8080/view/5.1_swarm/job/5.1_fuelmain.system_test.ubuntu.thread_4/14/

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "5.1.1"
  api: "1.0"
  build_number: "20"
  build_id: "2014-10-05_00-00-10"
  astute_sha: "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13"
  fuellib_sha: "46ad455514614ec2600314ac80191e0539ddfc04"
  ostf_sha: "64cb59c681658a7a55cc2c09d079072a41beb346"
  nailgun_sha: "eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d"
  fuelmain_sha: "ce6a2871734bb40e09a6f61e9d007bb7e324fada"

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Egor Kotko (ykotko) wrote :

Reproduced on:
{"build_id": "2014-10-28_00-01-12", "ostf_sha": "f47fd1d66a7255213ee075d5c11b8f111f922000", "build_number": "53", "auth_required": true, "api": "1.0", "nailgun_sha": "fb18068382d522b735ecf446c0f4166c129269fb", "production": "docker", "fuelmain_sha": "f3ad22d12c26794a05e62d46317fa1e47f7f1138", "astute_sha": "97eea90efe0a1f17b4934919d6e459d270c10372", "feature_groups": ["mirantis", "techpreview"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-10-28_00-01-12", "ostf_sha": "f47fd1d66a7255213ee075d5c11b8f111f922000", "build_number": "53", "api": "1.0", "nailgun_sha": "fb18068382d522b735ecf446c0f4166c129269fb", "production": "docker", "fuelmain_sha": "f3ad22d12c26794a05e62d46317fa1e47f7f1138", "astute_sha": "97eea90efe0a1f17b4934919d6e459d270c10372", "feature_groups": ["mirantis", "techpreview"], "release": "6.0", "fuellib_sha": "b8d244a900b25bed8f597e99b309f9ee4ad8ae56"}}}, "fuellib_sha": "b8d244a900b25bed8f597e99b309f9ee4ad8ae56"}

Revision history for this message
Egor Kotko (ykotko) wrote :
Changed in fuel:
status: Invalid → New
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Should be fixed with bp pacemaker-improvements by Corosync 2.x upgrading

Changed in fuel:
status: New → Triaged
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Changed in fuel:
status: Triaged → Invalid
Revision history for this message
Egor Kotko (ykotko) wrote :

Fail on this test again with error in puppet log:
http://paste.openstack.org/show/134777/

{"build_id": "2014-11-18_22-00-23", "ostf_sha": "82465a94eed4eff1fc8d8e1f2fb7e9993c22f068", "build_number": "114", "auth_required": true, "api": "1.0", "nailgun_sha": "b0add09c4361fee8fc70637c9a6ef42fbe738abe", "production": "docker", "fuelmain_sha": "e556f0e1b00c30ec5c4b374ca2878c047c8686c2", "astute_sha": "65eb911c38afc0e23d187772f9a05f703c685896", "feature_groups": ["mirantis"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-11-18_22-00-23", "ostf_sha": "82465a94eed4eff1fc8d8e1f2fb7e9993c22f068", "build_number": "114", "api": "1.0", "nailgun_sha": "b0add09c4361fee8fc70637c9a6ef42fbe738abe", "production": "docker", "fuelmain_sha": "e556f0e1b00c30ec5c4b374ca2878c047c8686c2", "astute_sha": "65eb911c38afc0e23d187772f9a05f703c685896", "feature_groups": ["mirantis"], "release": "6.0", "fuellib_sha": "5a5275370b33ab3b9a403728a1c7ad173289e4a0"}}}, "fuellib_sha": "5a5275370b33ab3b9a403728a1c7ad173289e4a0"}

Egor Kotko (ykotko)
Changed in fuel:
status: Invalid → New
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Download full text (4.8 KiB)

Seems that the corosync logic does not work right when the HA cluster has only one controller.
In this job : http://jenkins-product.srt.mirantis.net:8080/view/6.0_swarm/job/6.0_fuelmain.system_test.ubuntu.thread_4/26/ ,
on the node-1: pacemaker.log has the following when node-2 has joined (pay attention to the messages like: "error: clone_color: ....:0 is running on node-1 which isn't allowed"):

============= node-1: pacemaker.log
<30>Nov 19 13:25:30 node-1 corosync[8627]: [pcmk ] update_member info: update_member: 0xb9d040 Node 56978442 (node-2) born on: 880
<29>Nov 19 13:25:30 node-1 cib[8654]: notice: cib:diff: Diff: --- 0.41.17
<30>Nov 19 13:25:30 node-1 corosync[8627]: [pcmk ] update_member info: update_member: 0xb9d040 Node 56978442 now known as node-2 (was: (null))
<29>Nov 19 13:25:30 node-1 cib[8654]: notice: cib:diff: Diff: +++ 0.42.1 5bfc08455ff701ef05a3884260f7a29b
<30>Nov 19 13:25:30 node-1 corosync[8627]: [pcmk ] update_member info: update_member: Node node-2 now has process list: 00000000000000000000000000111312(1118994)
<29>Nov 19 13:25:30 node-1 crmd[8659]: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]
<29>Nov 19 13:25:30 node-1 cib[8654]: notice: cib:diff: Diff: --- 0.42.2
<30>Nov 19 13:25:30 node-1 corosync[8627]: [pcmk ] update_member info: update_member: Node node-2 now has 1 quorum votes (was 0)
<29>Nov 19 13:25:30 node-1 cib[8654]: notice: cib:diff: Diff: +++ 0.43.1 5490af35dd7d29f29cc94eb736a86c60
<30>Nov 19 13:25:31 node-1 corosync[8627]: [pcmk ] update_expected_votes info: update_expected_votes: Expected quorum votes 2 -> 3
<30>Nov 19 13:25:31 node-1 corosync[8627]: [pcmk ] send_member_notification info: send_member_notification: Sending membership update 880 to 2 children
<30>Nov 19 13:25:31 node-1 corosync[8627]: [CPG ] downlist_log chosen downlist: sender r(0) ip(10.108.101.4) ; members(old:2 left:0)
<29>Nov 19 13:25:31 node-1 corosync[8627]: [MAIN ] corosync_sync_completed Completed service synchronization, ready to provide service.

...

<29>Nov 19 13:25:34 node-1 pengine[8658]: notice: unpack_config: On loss of CCM Quorum: Ignore
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: common_apply_stickiness: vip__public[node-2] = -1000000
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: common_apply_stickiness: vip__public[node-3] = -1000000
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: common_apply_stickiness: ping_vip__public:0[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: common_apply_stickiness: vip__management[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: common_apply_stickiness: p_heat-engine:0[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: common_apply_stickiness: p_rabbitmq-server:0[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: common_apply_stickiness: p_mysql:0[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: common_apply_stickiness: p_haproxy:0[node-2] = 100
<27>Nov 19 13:25:34 node-1 pengine[8658]: error: clone_color: p_haproxy:0 is running on node-1 which isn't all...

Read more...

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

This scalability test contains several steps.
On the first step, HA cluster is deploing with only one controller.
When the cluster is ready, on the second step two more controllers is adding to the cluster and deploy starts again.

Looks like on the second step 'astute' configures an additional primary controller from those two additional controllers that (possible) causes a resource collision in corosync.

According to the astute.log on the master node, the first 'primary' controller was 'node-2':
======= astute.log
2014-11-19T12:44:09 debug: [417] Process message from worker queue:
...
\"nodes\": [{\"swift_zone\": \"2\", \"uid\": \"2\", \"public_address\": \"10.108.100.3\", \"internal_netmask\": \"255.255.255.0\", \"fqdn\": \"node-2.test.domain.local\", \"role\": \"primary-controller\"

And then the second 'primary' controller 'node-1' was added:
======= astute.log
2014-11-19T13:03:43 debug: [397] Process message from worker queue:
...
\"nodes\": [{\"swift_zone\": \"1\", \"uid\": \"1\", \"public_address\": \"10.108.100.4\", \"internal_netmask\": \"255.255.255.0\", \"fqdn\": \"node-1.test.domain.local\", \"role\": \"primary-controller\"

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Changed in fuel:
assignee: Sergii Golovatiuk (sgolovatiuk) → Fuel Python Team (fuel-python)
Revision history for this message
Dima Shulyak (dshulyak) wrote :

Exactly the same scalability setup worked for me, so this is definitely sporadic corosync/pacemaker issue.

In my opinion puppet apply with primary-controller role should be idempotent operation for any node in cluster, important part
is - this node should be deployed first.

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Fuel Library Team (fuel-library)
summary: - Deployment ha_flat_scalability finished with errors in puppet log
+ Two 'primary' controllers created on HA environment when additional
+ controllers added to a cluster
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Dima, why 'astute' provides to the 'puppet' "node-2.test.domain.local" with "role": "primary-controller" , and then "node-1.test.domain.local" with "role": "primary-controller" ?

This issue is constantly reproduced on CI and on my workstation. Could you provide the diagnostic snapshot and Fuel version so that we could analyze and compare it.

{u'build_id': u'2014-11-18_22-00-23', u'ostf_sha': u'82465a94eed4eff1fc8d8e1f2fb7e9993c22f068', u'build_number': u'114', u'auth_required': True, u'nailgun_sha': u'b0add09c4361fee8fc70637c9a6ef42fbe738abe', u'production': u'docker', u'api': u'1.0', u'fuelmain_sha': u'e556f0e1b00c30ec5c4b374ca2878c047c8686c2', u'astute_sha': u'65eb911c38afc0e23d187772f9a05f703c685896', u'feature_groups': [u'mirantis'], u'release': u'6.0', u'release_versions': {u'2014.2-6.0': {u'VERSION': {u'build_id': u'2014-11-18_22-00-23', u'ostf_sha': u'82465a94eed4eff1fc8d8e1f2fb7e9993c22f068', u'build_number': u'114', u'api': u'1.0', u'nailgun_sha': u'b0add09c4361fee8fc70637c9a6ef42fbe738abe', u'production': u'docker', u'fuelmain_sha': u'e556f0e1b00c30ec5c4b374ca2878c047c8686c2', u'astute_sha': u'65eb911c38afc0e23d187772f9a05f703c685896', u'feature_groups': [u'mirantis'], u'release': u'6.0', u'fuellib_sha': u'5a5275370b33ab3b9a403728a1c7ad173289e4a0'}}}, u'fuellib_sha': u'5a5275370b33ab3b9a403728a1c7ad173289e4a0'}

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Also as for chronology:
First deployed node was node-2 as primary contorller ( time frame ~12:40 ... 13:00 )
Second and third deployed nodes were node-2 as primary contorller and node-3 as slave (time frame ~13:05 ... 13:25 )

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

upd: Second and third deployed nodes were node-1 and node-3

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.