Action monitor for ns_haproxy OCF takes too long under load

Bug #1424959 reported by Leontiy Istomin
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Committed
High
Bogdan Dobrelya
5.0.x
Won't Fix
High
Bogdan Dobrelya
5.1.x
Won't Fix
High
Bogdan Dobrelya
6.0.x
Fix Committed
High
Bogdan Dobrelya
6.1.x
Fix Committed
High
Bogdan Dobrelya

Bug Description

[root@fuel ~]# fuel --fuel-version
api: '1.0'
astute_sha: 1f87a9b9a47de7498b4061d15a8c7fb9435709d5
auth_required: true
build_id: 2015-02-13_16-48-42
build_number: '120'
feature_groups:
- mirantis
fuellib_sha: 7f8d4382abfcd4338964182ebfea1d539f963e66
fuelmain_sha: 4b6bcf60d712b74a3a53252bfbb5ca34aa7497c4
nailgun_sha: 1e3a40dd8a17abe1d38f42da1e0dc1a6d4572666
ostf_sha: f9c37d0876141e1550eb4e703a8e500cd463282f
production: docker
python-fuelclient_sha: 61431ed16fc00039a269424bdbaa410277eff609
release: '6.1'
release_versions:
  2014.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: 1f87a9b9a47de7498b4061d15a8c7fb9435709d5
      build_id: 2015-02-13_16-48-42
      build_number: '120'
      feature_groups:
      - mirantis
      fuellib_sha: 7f8d4382abfcd4338964182ebfea1d539f963e66
      fuelmain_sha: 4b6bcf60d712b74a3a53252bfbb5ca34aa7497c4
      nailgun_sha: 1e3a40dd8a17abe1d38f42da1e0dc1a6d4572666
      ostf_sha: f9c37d0876141e1550eb4e703a8e500cd463282f
      production: docker
      python-fuelclient_sha: 61431ed16fc00039a269424bdbaa410277eff609
      release: '6.1'
Baremetal,Centos, HA, Neutron-gre,Ceph-all, Debug, nova-quotas, 6.1_120
Controllers:3 Computes:47

deployment was successfull
I've added glance image due the bug https://bugs.launchpad.net/mos/+bug/1415954
glance image-create --name 'TestVM' --is-public true --container-format='bare' --disk-format='qcow2' --min-ram=64 --property murano_image_info='{"title": "Murano Demo", "type": "cirros.demo"}' --file '/opt/vm/cirros-x86_64-disk.img'
I've added the follwoing line to /usr/bin/keystone-all due the bug https://bugs.launchpad.net/mos/+bug/1413341
147 config.setup_logging()

I have face with the following exception during rally test:

2015-02-20 22:30:52.125 20140 ERROR rally.benchmark.runners.base [-] Failed to get the resource <Volume: 9abbe8cd-12ae-45a9-aa57-1b18bced545d>: Unable to establish connection: ('Connection aborted.', BadStatusLine("''",))
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base Traceback (most recent call last):
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/runners/base.py", line 77, in _run_scenario_once
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base method_name)(**kwargs) or scenario_output
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/cinder/volumes.py", line 105, in create_and_delete_volume
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base self._delete_volume(volume)
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/base.py", line 254, in func_atomic_actions
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base f = func(self, *args, **kwargs)
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/cinder/utils.py", line 107, in _delete_volume
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base check_interval=CONF.benchmark.cinder_volume_delete_poll_interval
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/utils.py", line 130, in wait_for_delete
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base resource = update_resource(resource)
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/utils.py", line 54, in _get_from_manager
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base raise exceptions.GetResourceFailure(resource=resource, err=e)
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base GetResourceFailure: Failed to get the resource <Volume: 9abbe8cd-12ae-45a9-aa57-1b18bced545d>: Unable to establish connection: ('Connection aborted.', BadStatusLine("''",))

At the same time I have found in messages log on primary controller node:
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy Stats started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy horizon started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy keystone-1 started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy keystone-2 started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy nova-api-1 started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy nova-api-2 started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy nova-metadata-api started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy cinder-api started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy glance-api started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy neutron started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy glance-registry started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy rabbitmq started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy mysqld started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy radosgw started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy heat-api started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy heat-api-cfn started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy heat-api-cloudwatch started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy nova-novncproxy started.

from syslog log file:
<129>Feb 20 22:30:53 node-1 haproxy[13334]: Server radosgw/node-1 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 1ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Feb 20 22:30:53 node-1 haproxy[13334]: Server radosgw/node-44 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 1ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Feb 20 22:30:53 node-1 haproxy[13334]: Server radosgw/node-49 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 1ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Feb 20 22:30:53 node-1 haproxy[13334]: proxy radosgw has no server available!

from pacemaker log:
Feb 20 22:30:54 [23083] node-1.domain.tld lrmd: info: log_finished: finished - rsc:p_haproxy action:start call_id:241 pid:13266 exit-code:0 exec-time:1963ms queue-time:0ms

[root@node-1 ~]# crm status
Last updated: Tue Feb 24 07:49:38 2015
Last change: Thu Feb 19 23:39:25 2015
Stack: corosync
Current DC: node-1.domain.tld (1) - partition with quorum
Version: 1.1.11-97629de
3 Nodes configured
27 Resources configured

Online: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]

 vip__public (ocf::fuel:ns_IPaddr2): Started node-1.domain.tld
 Clone Set: clone_ping_vip__public [ping_vip__public]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
 vip__management (ocf::fuel:ns_IPaddr2): Started node-44.domain.tld
 Clone Set: clone_p_openstack-heat-engine [p_openstack-heat-engine]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
 Clone Set: clone_p_neutron-openvswitch-agent [p_neutron-openvswitch-agent]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
 p_neutron-dhcp-agent (ocf::fuel:ocf-neutron-dhcp-agent): Started node-49.domain.tld
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
 Clone Set: clone_p_neutron-l3-agent [p_neutron-l3-agent]
     p_neutron-l3-agent (ocf::fuel:ocf-neutron-l3-agent): FAILED node-1.domain.tld (unmanaged)
     p_neutron-l3-agent (ocf::fuel:ocf-neutron-l3-agent): FAILED node-49.domain.tld (unmanaged)
     p_neutron-l3-agent (ocf::fuel:ocf-neutron-l3-agent): FAILED node-44.domain.tld (unmanaged)
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-1.domain.tld ]
     Slaves: [ node-44.domain.tld node-49.domain.tld ]
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]

Failed actions:
    ping_vip__public_monitor_20000 on node-1.domain.tld 'unknown error' (1): call=232, status=Timed Out, last-rc-change='Tue Feb 24 02:10:48 2015', queued=0ms, exec=0ms
    p_neutron-l3-agent_stop_0 on node-1.domain.tld 'unknown error' (1): call=192, status=Timed Out, last-rc-change='Fri Feb 20 18:10:43 2015', queued=0ms, exec=60000ms
    p_neutron-l3-agent_stop_0 on node-1.domain.tld 'unknown error' (1): call=192, status=Timed Out, last-rc-change='Fri Feb 20 18:10:43 2015', queued=0ms, exec=60000ms
    p_neutron-metadata-agent_monitor_60000 on node-1.domain.tld 'unknown error' (1): call=648, status=Timed Out, last-rc-change='Tue Feb 24 02:10:48 2015', queued=0ms, exec=0ms
    p_neutron-openvswitch-agent_monitor_20000 on node-1.domain.tld 'unknown error' (1): call=717, status=Timed Out, last-rc-change='Tue Feb 24 06:45:48 2015', queued=0ms, exec=0ms
    p_neutron-dhcp-agent_monitor_20000 on node-1.domain.tld 'unknown error' (1): call=129, status=Timed Out, last-rc-change='Fri Feb 20 18:10:42 2015', queued=0ms, exec=0ms
    ping_vip__public_monitor_20000 on node-49.domain.tld 'unknown error' (1): call=70, status=Timed Out, last-rc-change='Sun Feb 22 03:30:41 2015', queued=0ms, exec=0ms
    p_neutron-openvswitch-agent_monitor_20000 on node-49.domain.tld 'unknown error' (1): call=774, status=Timed Out, last-rc-change='Tue Feb 24 06:48:14 2015', queued=0ms, exec=0ms
    p_neutron-dhcp-agent_monitor_20000 on node-49.domain.tld 'unknown error' (1): call=776, status=Timed Out, last-rc-change='Tue Feb 24 06:30:50 2015', queued=0ms, exec=0ms
    p_neutron-l3-agent_stop_0 on node-49.domain.tld 'unknown error' (1): call=164, status=Timed Out, last-rc-change='Fri Feb 20 15:40:32 2015', queued=0ms, exec=60000ms
    p_neutron-l3-agent_stop_0 on node-49.domain.tld 'unknown error' (1): call=164, status=Timed Out, last-rc-change='Fri Feb 20 15:40:32 2015', queued=0ms, exec=60000ms
    p_neutron-metadata-agent_monitor_60000 on node-49.domain.tld 'unknown error' (1): call=640, status=Timed Out, last-rc-change='Tue Feb 24 06:48:14 2015', queued=0ms, exec=0ms
    ping_vip__public_monitor_20000 on node-44.domain.tld 'unknown error' (1): call=557, status=Timed Out, last-rc-change='Mon Feb 23 15:40:47 2015', queued=0ms, exec=0ms
    p_neutron-openvswitch-agent_monitor_20000 on node-44.domain.tld 'unknown error' (1): call=732, status=Timed Out, last-rc-change='Tue Feb 24 05:40:51 2015', queued=0ms, exec=0ms
    p_neutron-dhcp-agent_monitor_20000 on node-44.domain.tld 'unknown error' (1): call=238, status=Timed Out, last-rc-change='Fri Feb 20 22:50:50 2015', queued=0ms, exec=0ms
    p_neutron-l3-agent_stop_0 on node-44.domain.tld 'unknown error' (1): call=164, status=Timed Out, last-rc-change='Fri Feb 20 10:40:19 2015', queued=0ms, exec=60001ms
    p_neutron-l3-agent_stop_0 on node-44.domain.tld 'unknown error' (1): call=164, status=Timed Out, last-rc-change='Fri Feb 20 10:40:19 2015', queued=0ms, exec=60001ms
    p_neutron-metadata-agent_monitor_60000 on node-44.domain.tld 'unknown error' (1): call=602, status=Timed Out, last-rc-change='Mon Feb 23 21:20:21 2015', queued=0ms, exec=0ms

Tags: scale
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

According to the logs from the affected env, haproxy instances at node-44, node-49 wasn't restarted so the downtime was temporary and local for node-1. This should be considered as a medium impact I believe.

The instance of haproxy at node-1 was restarted due to timeout of monitor operation:
pacemaker.log:
Feb 20 22:30:50 [23083] node-1.domain.tld lrmd: warning: child_timeout_callback: p_haproxy_monitor_20000 process (PID 12917) timed out
Feb 20 22:30:51 [23083] node-1.domain.tld lrmd: warning: operation_finished: p_haproxy_monitor_20000:12917 - timed out after 10000ms
Feb 20 22:30:51 [23086] node-1.domain.tld crmd: error: process_lrm_event: Operation p_haproxy_monitor_20000: Timed Out (node=node-1.domain.tld, call=238, timeout=10000ms)

The RC is yet unknown but might be related with critical level of IO load:
atop 2015/02/20 22:30:33 - 2015/02/20 22:30:53:
LVM | os-root | busy 95% | read 1782 | write 432390 | KiB/r 35 | KiB/w 3 | MBr/s 3.10 | MBw/s 84.45 | avio 0.04 ms |
DSK | sda | busy 95% | read 1881 | write 3948 | KiB/r 34 | KiB/w 457 | MBr/s 3.16 | MBw/s 88.25 | avio 3.23 ms

Changed in mos:
milestone: none → 6.1
importance: Undecided → Medium
status: New → Incomplete
Revision history for this message
Leontiy Istomin (listomin) wrote :

atop shows that jbd2/dm-0-8 process uses 95% of IOps. It occurred on each controller node, but corosync restarts resources only on one of them

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Rised to high, as it looks like haproxy RA is affected by performance issues and this could impact the cluster operations

summary: - corosync restarts resources during rally tests
+ Action monitor for ns_haproxy OCF takes too long under load
Changed in mos:
importance: Medium → High
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Here is lrmd, crmd logs from node-1 in attachment. There are multiple timeouts for monitor actions started from ./node-1.domain.tld/lrmd.log:2015-02-20T11:40:26.774728+00:00, and there are multiple "PID will not die" events as well.

Also, atop logs show that haproxy monitor actions sometimes can take more that 20 seconds.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Changed in mos:
assignee: Bogdan Dobrelya (bogdando) → Fuel Library Team (fuel-library)
status: Incomplete → Confirmed
Changed in mos:
assignee: Fuel Library Team (fuel-library) → svasheka (svasheka)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The solution is to increase monitor interval and timeout, add logs for haproxy OCF and optimize its monitor action, if possible

Changed in mos:
status: Confirmed → Triaged
assignee: svasheka (svasheka) → Bogdan Dobrelya (bogdando)
Revision history for this message
Leontiy Istomin (listomin) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

"PID will not die" issue can be related to https://bugs.launchpad.net/fuel/+bug/1425579

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Changed in mos:
status: Triaged → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Won't Fix for 5.0-updates and 5.1.1-updates - this is not customer-found issue and it's too risky to backport as the change is big (77 LOC for OCF script)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.