03:48:47
-----------
Pacemaker tells that there's timeout in 60 seconds during stop and q-agent-cleanup.py is still running:
<27>Mar 25 03:48:47 node-14 crmd[6977]: error: process_lrm_event: LRM operation p_neutron-dhcp-agent_stop_0 (407) Timed Out (timeout=60000ms)
<29>Mar 25 03:48:47 node-14 crmd[6977]: notice: process_lrm_event: node-14.domain.tld-p_neutron-dhcp-agent_stop_0:407 [ 2015-03-25 03:47:50,827 - INFO - Started: /usr/bin/q-agent-cleanup.py --agent=dhcp
--cleanup-ports\n ]
It means that OCF script is in https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/cluster/files/ocf/ocf-neutron-dhcp-agent#L639 and function is not finished.
As result pacemaker thinks that stop hanged and marks resource as unmanaged. The same happens on all 3 controllers leaving OpenStack without any DHCP agents.
q-agent-cleanup is slow when number of namespaces is high. E.g. __collect_ports_for_namespace is called for every namespace. Even when the load is low call to "ip netns exec <namespace> ip l show" takes 0.1 seconds, resulting in 200 seconds on 2k namespaces which is 3 times more than timeout in Pacemaker.
Reconstruction of workflow sequence: ~~~~~~~ ~~~~~~~ ~~~~~~~ ~~
~~~~~~~
03:47:47 - 03:47:50 ------- ------- ----- ocf/resource. d/fuel/ ocf-neutron- dhcp-agent, function "neutron_ dhcp_agent_ stop" (this seen by message "OpenStack DHCP Agent (neutron- dhcp-agent) stopped", https:/ /github. com/stackforge/ fuel-library/ blob/master/ deployment/ puppet/ cluster/ files/ocf/ ocf-neutron- dhcp-agent# L633)
-------
Works OCF script /usr/lib/
During this time the function kills all processes (https:/ /github. com/stackforge/ fuel-library/ blob/master/ deployment/ puppet/ cluster/ files/ocf/ ocf-neutron- dhcp-agent# L599) by sending SIGTERM. It's observed as SIGTERM message in dhcp-agent log. At least 2 processes are not killed in time and script tries to kill them 3 seconds later. We see message from kill that it has not found 2 pids
03:47:50 /github. com/stackforge/ fuel-library/ blob/master/ deployment/ puppet/ cluster/ files/ocf/ ocf-neutron- dhcp-agent# L633)
------------
OCF script writes that it the agent is stopped (https:/
03:48:47 dhcp-agent_ stop_0 (407) Timed Out (timeout=60000ms) domain. tld-p_neutron- dhcp-agent_ stop_0: 407 [ 2015-03-25 03:47:50,827 - INFO - Started: /usr/bin/ q-agent- cleanup. py --agent=dhcp /github. com/stackforge/ fuel-library/ blob/master/ deployment/ puppet/ cluster/ files/ocf/ ocf-neutron- dhcp-agent# L639 and function is not finished.
-----------
Pacemaker tells that there's timeout in 60 seconds during stop and q-agent-cleanup.py is still running:
<27>Mar 25 03:48:47 node-14 crmd[6977]: error: process_lrm_event: LRM operation p_neutron-
<29>Mar 25 03:48:47 node-14 crmd[6977]: notice: process_lrm_event: node-14.
--cleanup-ports\n ]
It means that OCF script is in https:/
As result pacemaker thinks that stop hanged and marks resource as unmanaged. The same happens on all 3 controllers leaving OpenStack without any DHCP agents.
q-agent-cleanup is slow when number of namespaces is high. E.g. __collect_ ports_for_ namespace is called for every namespace. Even when the load is low call to "ip netns exec <namespace> ip l show" takes 0.1 seconds, resulting in 200 seconds on 2k namespaces which is 3 times more than timeout in Pacemaker.