In testrun https://solutions.qa.canonical.com/v2/testruns/8b382b6f-73e8-455c-be26-d7b8d9b02143, openstack-loadbalancer blocks with:
```
openstack-loadbalancer/1 active idle 1/lxd/9 10.246.167.21 Unit is ready
filebeat/38 active idle 10.246.167.21 Filebeat ready.
hacluster-openstack-loadbalancer/2 active idle 10.246.167.21 Unit is ready and clustered
landscape-client/38 maintenance idle 10.246.167.21 Need computer-title and juju-info to proceed
logrotated/31 active idle 10.246.167.21 Unit is ready.
nrpe/47 active idle 10.246.167.21 icmp,5666/tcp Ready
prometheus-grok-exporter/39 active idle 10.246.167.21 9144/tcp Unit is ready
public-policy-routing/18 active idle 10.246.167.21 Unit is ready
telegraf/38 active idle 10.246.167.21 9103/tcp Monitoring openstack-loadbalancer/1 (source version/commit d208a64)
openstack-loadbalancer/2* blocked idle 2/lxd/10 10.246.165.129 Services not running that should be: haproxy
filebeat/15 active idle 10.246.165.129 Filebeat ready.
hacluster-openstack-loadbalancer/0* active idle 10.246.165.129 Unit is ready and clustered
landscape-client/14 maintenance idle 10.246.165.129 Need computer-title and juju-info to proceed
logrotated/9 active idle 10.246.165.129 Unit is ready.
nrpe/21 active idle 10.246.165.129 icmp,5666/tcp Ready
prometheus-grok-exporter/15 active idle 10.246.165.129 9144/tcp Unit is ready
public-policy-routing/3 active idle 10.246.165.129 Unit is ready
telegraf/14 active idle 10.246.165.129 9103/tcp Monitoring openstack-loadbalancer/2 (source version/commit d208a64)
```
In the logs we see a timeout trying to stop the haproxy:
```
var/log/pacemaker/pacemaker.log-Nov 17 17:50:41.900 juju-173067-2-lxd-10 pacemaker-based [39820] (cib_perform_op) info: + /cib/status/node_state[@id='1000']/lrm[@id='1000']/lrm_resources/lrm_resource[@id='res_openstack_loadbalancer_haproxy']/lrm_rsc_op[@id='res_openstack_loadbalancer_haproxy_last_0']: @operation_key=res_openstack_loadbalancer_haproxy_stop_0, @operation=stop, @crm-debug-origin=do_update_resource, @transition-key=2:9:0:fd391420-ca57-4da8-9021-1a51c847e3aa, @transition-magic=-1:193;2:9:0:fd391420-ca57-4da8-9021-1a51c847e3aa, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1668707441, @exec-time=0
var/log/pacemaker/pacemaker.log-Nov 17 17:50:41.900 juju-173067-2-lxd-10 pacemaker-based [39820] (cib_process_request) info: Completed cib_modify operation for section status: OK (rc=0, origin=juju-173067-2-lxd-10/crmd/74, version=0.12.22)
var/log/pacemaker/pacemaker.log-Nov 17 17:50:46.900 juju-173067-2-lxd-10 pacemaker-based [39820] (cib_process_ping) info: Reporting our current digest to juju-173067-2-lxd-10: d4e7bf75ba1d472cf2644a30720ce797 for 0.12.22 (0x55f2216e0360 0)
var/log/pacemaker/pacemaker.log-Nov 17 17:50:57.888 juju-173067-2-lxd-10 pacemaker-controld [39825] (throttle_check_thresholds) notice: High CPU load detected: 20.889999
var/log/pacemaker/pacemaker.log-Nov 17 17:51:01.880 juju-173067-2-lxd-10 pacemaker-execd [39822] (async_action_complete) warning: res_openstack_loadbalancer_haproxy_stop_0[43103] timed out after 20000ms
var/log/pacemaker/pacemaker.log-Nov 17 17:51:01.880 juju-173067-2-lxd-10 pacemaker-execd [39822] (log_finished) info: res_openstack_loadbalancer_haproxy stop (call 21, PID 43103) could not be executed: Timed Out (Process did not exit within specified timeout) (execution time 20.001s)
var/log/pacemaker/pacemaker.log:Nov 17 17:51:01.880 juju-173067-2-lxd-10 pacemaker-controld [39825] (log_executor_event) error: Result of stop operation for res_openstack_loadbalancer_haproxy on juju-173067-2-lxd-10: Timed Out after 20s (Process did not exit within specified timeout) | CIB update 75, graph action confirmed; call=21 key=res_openstack_loadbalancer_haproxy_stop_0
```
I can't locate more hints about what is happening.
Configs and crashdumps are here:
https://oil-jenkins.canonical.com/artifacts/8b382b6f-73e8-455c-be26-d7b8d9b02143/index.html
This little snippet looks suspicious:
var/log/ pacemaker/ pacemaker. log-Nov 17 17:50:57.888 juju-173067- 2-lxd-10 pacemaker-controld [39825] (throttle_ check_threshold s) notice: High CPU load detected: 20.889999 pacemaker/ pacemaker. log-Nov 17 17:51:01.880 juju-173067- 2-lxd-10 pacemaker-execd [39822] (async_ action_ complete) warning: res_openstack_ loadbalancer_ haproxy_ stop_0[ 43103] timed out after 20000ms pacemaker/ pacemaker. log-Nov 17 17:51:01.880 juju-173067- 2-lxd-10 pacemaker-execd [39822] (log_finished) info: res_openstack_ loadbalancer_ haproxy stop (call 21, PID 43103) could not be executed: Timed Out (Process did not exit within specified timeout) (execution time 20.001s)
var/log/
var/log/
i.e. the 2/lxd/10 had a load average of 20 and the pacemaker command timed out at 20 seconds.
The machine.lock for machine 2 at roughly the same time shows:
2022-11-17 17:50:45 unit-ceph-osd-2: ceph-osd/2 uniter (run update-status hook), waited 35s, held 4s compute- 2: nova-compute/2 uniter (run relation-changed (35; unit: nova-compute/3) hook), waited 24s, held 0s -libvirt- exporter- 0: prometheus- libvirt- exporter/ 0 uniter (run update-status hook), waited 23s, held 2s grok-exporter/ 0) hook), waited 11s, held 10s
2022-11-17 17:50:56 unit-nrpe-0: nrpe/0 uniter (run start hook), waited 37s, held 12s
2022-11-17 17:51:05 unit-nrpe-4: nrpe/4 uniter (run start hook), waited 37s, held 9s
2022-11-17 17:51:05 unit-nova-
2022-11-17 17:51:07 unit-prometheus
2022-11-17 17:51:18 unit-nrpe-0: nrpe/0 uniter (run relation-joined (401; unit: prometheus-
e.g. hooks are waiting +30 seconds to get a go at running, which means the previous hook is running at the time.
2/lxd/10's machine.lock at the time shows that loadbalancer unit held for 41 which is consistent with the timeout.
2022-11-17 17:50:02 unit-hacluster- openstack- loadbalancer- 0: hacluster- openstack- loadbalancer/ 0 uniter (run relation-changed (267; unit: openstack- loadbalancer/
2) hook), waited 7s, held 41s
Machine 2's psaux had 750_ processes, and this is during deploy.
Machine 2 also has 11 LXD containers (potentially 11 simultaneously juju hooks).
Machine 2 also has:
hardware: arch=amd64 cores=4 mem=65536M tags=foundation -nodes, rack1,regular, silo6 ity-zone= zone3
availabil
I'm going to hazard a guess that this was a resource contention issue, and the pacemaker command did simply just time out due to not getting sufficient resources.
A couple of things probably should be done here:
1. extending the timeouts on the pacemaker commands (but that in itself causes problems)
2. Collecting load averages/CPU/IO stats for the machine so that loads during these kinds of errors can be analysed.
It's only a hunch that this is load related; there are some odd messages in the various logs, particularly about when a command is initiated vs when the timeout occurs, but I think we need to try to rule out timeouts due to resource constraints.