scenario003-multinode-oooq-container too slow to stop container, makes deployment fail
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Incomplete
|
High
|
Unassigned |
Bug Description
From failed run http://
While deploying a HA containerized overcloud, pacemaker determines that it needs to restart a docker container which will be used to run mysql/galera.
When pacemaker schedules a stop operation, the ocf resource agent that manages docker containers spawns a 'docker stop' command Which takes too long (> 20s) to finish and makes the stop action time out for pacemaker.
A failed stop operation is a blocking operation for pacemaker (fencing is not configured at this stage), so the galera container won't restart, and the deployment eventually fails.
A docker stop operators usually never takes more than 20s to operate, so the host might have been under heavy/abnormal load.
Changed in tripleo: | |
importance: | Undecided → Critical |
milestone: | none → stein-1 |
tags: | added: queens-backport-potential rocky-backport-potential |
Changed in tripleo: | |
status: | New → Triaged |
importance: | Critical → High |
Changed in tripleo: | |
milestone: | stein-1 → stein-2 |
Changed in tripleo: | |
milestone: | stein-2 → stein-3 |
Changed in tripleo: | |
milestone: | stein-3 → stein-rc1 |
Changed in tripleo: | |
milestone: | stein-rc1 → train-1 |
Changed in tripleo: | |
milestone: | train-1 → train-2 |
Changed in tripleo: | |
milestone: | train-2 → train-3 |
Changed in tripleo: | |
milestone: | train-3 → ussuri-1 |
Changed in tripleo: | |
milestone: | ussuri-1 → ussuri-2 |
Changed in tripleo: | |
milestone: | ussuri-2 → ussuri-3 |
Changed in tripleo: | |
status: | Triaged → Incomplete |
Changed in tripleo: | |
milestone: | ussuri-3 → ussuri-rc3 |
Changed in tripleo: | |
milestone: | ussuri-rc3 → victoria-1 |
Changed in tripleo: | |
milestone: | victoria-1 → victoria-3 |
Having a look at the corosync logs on the overcloud node [1]
puppet-pacemaker enables the galera-bundle resource to allow start the containerized galera
Oct 23 11:46:55 [19679] centos- 7-inap- mtl01-000332949 5 crmd: info: abort_transitio n_graph: Transition aborted by deletion of meta_attributes [@id='galera- bundle- meta_attributes ']: Configuration change | cib=0.17.0 source= te_update_ diff:456 path=/cib/ configuration/ resources/ bundle[ @id='galera- bundle' ]/meta_ attributes[ @id='galera- bundle- meta_attributes '] complete=true
which let pengine schedule a start action to create the docker container
Oct 23 11:46:55 [19678] centos- 7-inap- mtl01-000332949 5 pengine: info: RecurringOp: Start recurring monitor (60s) for galera- bundle- docker- 0 on centos- 7-inap- mtl01-000332949 5 7-inap- mtl01-000332949 5 crmd: notice: te_rsc_command: Initiating start operation galera- bundle- docker- 0_start_ 0 locally on centos- 7-inap- mtl01-000332949 5 | action 27
Oct 23 11:46:55 [19679] centos-
Oct 23 11:46:55 docker( galera- bundle- docker- 0)[40767] : INFO: checking for nsenter, which is required when 'monitor_cmd' is specified galera- bundle- docker- 0)[40767] : INFO: running container galera- bundle- docker- 0 for the first time galera- bundle- docker- 0)[40767] : INFO: monitor cmd passed: exit code = 0 galera- bundle- docker- 0)[40767] : NOTICE: Container galera- bundle- docker- 0 started successfully 7-inap- mtl01-000332949 5 lrmd: info: log_finished: finished - rsc:galera- bundle- docker- 0 action:start call_id:17 pid:40767 exit-code:0 exec-time:642ms queue-time:0ms
Oct 23 11:46:55 docker(
Oct 23 11:46:56 docker(
Oct 23 11:46:56 docker(
Oct 23 11:46:56 [19676] centos-
and then a first recurring monitor operation is schedule to check wehther the docker container holding the galera service is running properly. And it succeeds
Oct 23 11:46:56 [19679] centos- 7-inap- mtl01-000332949 5 crmd: notice: te_rsc_command: Initiating monitor operation galera- bundle- docker- 0_monitor_ 60000 locally on centos- 7-inap- mtl01-000332949 5 | action 28 galera- bundle- docker- 0)[40962] : INFO: monitor cmd passed: exit code = 0 7-inap- mtl01-000332949 5 crmd: info: process_lrm_event: Result of monitor operation for galera- bundle- docker- 0 on centos- 7-inap- mtl01-000332949 5: 0 (ok) | call=18 key=galera- bundle- docker- 0_monitor_ 60000 confirmed=false cib-update=105
Oct 23 11:46:56 docker(
Oct 23 11:46:56 [19679] centos-
then when the docker container is running, puppet-pacemaker creates a galera ocf resource in the cluster, to be ran in this docker container
Oct 23 11:47:23 [19674] centos- 7-inap- mtl01-000332949 5 cib: info: cib_process_ request: Forwarding cib_apply_diff operation for section 'all' to all (origin= local/cibadmin/ 2) 7-inap- mtl01-000332949 5 cib: info: cib_perform_op: Diff: --- 0.17.4 2 7-inap- mtl01-000332949 5 cib: info: cib_perform_op: Diff: +++ 0.18.0 (null) 7-inap- mtl01-000332949 5 cib: info: cib_perform_op: + /cib: @epoch=18, @num_updates=0 7-inap- mtl01-000332949 5 cib: info: ci...
Oct 23 11:47:23 [19674] centos-
Oct 23 11:47:23 [19674] centos-
Oct 23 11:47:23 [19674] centos-
Oct 23 11:47:23 [19674] centos-