Mirantis OpenStack

Action monitor for ns_haproxy OCF takes too long under load

Bug #1424959 reported by Leontiy Istomin on 2015-02-24

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Fix Committed	High	Bogdan Dobrelya	Mirantis OpenStack 6.1
5.0.x	Won't Fix	High	Bogdan Dobrelya	Mirantis OpenStack 5.0-updates
5.1.x	Won't Fix	High	Bogdan Dobrelya	Mirantis OpenStack 5.1.1-updates
6.0.x	Fix Committed	High	Bogdan Dobrelya	Mirantis OpenStack 6.0-updates
6.1.x	Fix Committed	High	Bogdan Dobrelya	Mirantis OpenStack 6.1

Bug Description

[root@fuel ~]# fuel --fuel-version
api: '1.0'
astute_sha: 1f87a9b9a47de7498b4061d15a8c7fb9435709d5
auth_required: true
build_id: 2015-02-13_16-48-42
build_number: '120'
feature_groups:
- mirantis
fuellib_sha: 7f8d4382abfcd4338964182ebfea1d539f963e66
fuelmain_sha: 4b6bcf60d712b74a3a53252bfbb5ca34aa7497c4
nailgun_sha: 1e3a40dd8a17abe1d38f42da1e0dc1a6d4572666
ostf_sha: f9c37d0876141e1550eb4e703a8e500cd463282f
production: docker
python-fuelclient_sha: 61431ed16fc00039a269424bdbaa410277eff609
release: '6.1'
release_versions:
  2014.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: 1f87a9b9a47de7498b4061d15a8c7fb9435709d5
      build_id: 2015-02-13_16-48-42
      build_number: '120'
      feature_groups:
      - mirantis
      fuellib_sha: 7f8d4382abfcd4338964182ebfea1d539f963e66
      fuelmain_sha: 4b6bcf60d712b74a3a53252bfbb5ca34aa7497c4
      nailgun_sha: 1e3a40dd8a17abe1d38f42da1e0dc1a6d4572666
      ostf_sha: f9c37d0876141e1550eb4e703a8e500cd463282f
      production: docker
      python-fuelclient_sha: 61431ed16fc00039a269424bdbaa410277eff609
      release: '6.1'
Baremetal,Centos, HA, Neutron-gre,Ceph-all, Debug, nova-quotas, 6.1_120
Controllers:3 Computes:47

deployment was successfull
I've added glance image due the bug https://bugs.launchpad.net/mos/+bug/1415954
glance image-create --name 'TestVM' --is-public true --container-format='bare' --disk-format='qcow2' --min-ram=64 --property murano_image_info='{"title": "Murano Demo", "type": "cirros.demo"}' --file '/opt/vm/cirros-x86_64-disk.img'
I've added the follwoing line to /usr/bin/keystone-all due the bug https://bugs.launchpad.net/mos/+bug/1413341
147 config.setup_logging()

I have face with the following exception during rally test:

2015-02-20 22:30:52.125 20140 ERROR rally.benchmark.runners.base [-] Failed to get the resource <Volume: 9abbe8cd-12ae-45a9-aa57-1b18bced545d>: Unable to establish connection: ('Connection aborted.', BadStatusLine("''",))
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base Traceback (most recent call last):
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/runners/base.py", line 77, in _run_scenario_once
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base method_name)(**kwargs) or scenario_output
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/cinder/volumes.py", line 105, in create_and_delete_volume
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base self._delete_volume(volume)
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/base.py", line 254, in func_atomic_actions
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base f = func(self, *args, **kwargs)
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/cinder/utils.py", line 107, in _delete_volume
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base check_interval=CONF.benchmark.cinder_volume_delete_poll_interval
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/utils.py", line 130, in wait_for_delete
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base resource = update_resource(resource)
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/utils.py", line 54, in _get_from_manager
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base raise exceptions.GetResourceFailure(resource=resource, err=e)
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base GetResourceFailure: Failed to get the resource <Volume: 9abbe8cd-12ae-45a9-aa57-1b18bced545d>: Unable to establish connection: ('Connection aborted.', BadStatusLine("''",))

At the same time I have found in messages log on primary controller node:
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy Stats started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy horizon started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy keystone-1 started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy keystone-2 started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy nova-api-1 started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy nova-api-2 started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy nova-metadata-api started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy cinder-api started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy glance-api started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy neutron started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy glance-registry started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy rabbitmq started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy mysqld started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy radosgw started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy heat-api started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy heat-api-cfn started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy heat-api-cloudwatch started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy nova-novncproxy started.

from syslog log file:
<129>Feb 20 22:30:53 node-1 haproxy[13334]: Server radosgw/node-1 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 1ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Feb 20 22:30:53 node-1 haproxy[13334]: Server radosgw/node-44 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 1ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Feb 20 22:30:53 node-1 haproxy[13334]: Server radosgw/node-49 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 1ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Feb 20 22:30:53 node-1 haproxy[13334]: proxy radosgw has no server available!

from pacemaker log:
Feb 20 22:30:54 [23083] node-1.domain.tld lrmd: info: log_finished: finished - rsc:p_haproxy action:start call_id:241 pid:13266 exit-code:0 exec-time:1963ms queue-time:0ms

[root@node-1 ~]# crm status
Last updated: Tue Feb 24 07:49:38 2015
Last change: Thu Feb 19 23:39:25 2015
Stack: corosync
Current DC: node-1.domain.tld (1) - partition with quorum
Version: 1.1.11-97629de
3 Nodes configured
27 Resources configured

Online: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]

vip__public (ocf::fuel:ns_IPaddr2): Started node-1.domain.tld
Clone Set: clone_ping_vip__public [ping_vip__public]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
vip__management (ocf::fuel:ns_IPaddr2): Started node-44.domain.tld
Clone Set: clone_p_openstack-heat-engine [p_openstack-heat-engine]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
Clone Set: clone_p_neutron-openvswitch-agent [p_neutron-openvswitch-agent]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
p_neutron-dhcp-agent (ocf::fuel:ocf-neutron-dhcp-agent): Started node-49.domain.tld
Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
Clone Set: clone_p_neutron-l3-agent [p_neutron-l3-agent]
     p_neutron-l3-agent (ocf::fuel:ocf-neutron-l3-agent): FAILED node-1.domain.tld (unmanaged)
     p_neutron-l3-agent (ocf::fuel:ocf-neutron-l3-agent): FAILED node-49.domain.tld (unmanaged)
     p_neutron-l3-agent (ocf::fuel:ocf-neutron-l3-agent): FAILED node-44.domain.tld (unmanaged)
Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-1.domain.tld ]
     Slaves: [ node-44.domain.tld node-49.domain.tld ]
Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]

Failed actions:
    ping_vip__public_monitor_20000 on node-1.domain.tld 'unknown error' (1): call=232, status=Timed Out, last-rc-change='Tue Feb 24 02:10:48 2015', queued=0ms, exec=0ms
    p_neutron-l3-agent_stop_0 on node-1.domain.tld 'unknown error' (1): call=192, status=Timed Out, last-rc-change='Fri Feb 20 18:10:43 2015', queued=0ms, exec=60000ms
    p_neutron-l3-agent_stop_0 on node-1.domain.tld 'unknown error' (1): call=192, status=Timed Out, last-rc-change='Fri Feb 20 18:10:43 2015', queued=0ms, exec=60000ms
    p_neutron-metadata-agent_monitor_60000 on node-1.domain.tld 'unknown error' (1): call=648, status=Timed Out, last-rc-change='Tue Feb 24 02:10:48 2015', queued=0ms, exec=0ms
    p_neutron-openvswitch-agent_monitor_20000 on node-1.domain.tld 'unknown error' (1): call=717, status=Timed Out, last-rc-change='Tue Feb 24 06:45:48 2015', queued=0ms, exec=0ms
    p_neutron-dhcp-agent_monitor_20000 on node-1.domain.tld 'unknown error' (1): call=129, status=Timed Out, last-rc-change='Fri Feb 20 18:10:42 2015', queued=0ms, exec=0ms
    ping_vip__public_monitor_20000 on node-49.domain.tld 'unknown error' (1): call=70, status=Timed Out, last-rc-change='Sun Feb 22 03:30:41 2015', queued=0ms, exec=0ms
    p_neutron-openvswitch-agent_monitor_20000 on node-49.domain.tld 'unknown error' (1): call=774, status=Timed Out, last-rc-change='Tue Feb 24 06:48:14 2015', queued=0ms, exec=0ms
    p_neutron-dhcp-agent_monitor_20000 on node-49.domain.tld 'unknown error' (1): call=776, status=Timed Out, last-rc-change='Tue Feb 24 06:30:50 2015', queued=0ms, exec=0ms
    p_neutron-l3-agent_stop_0 on node-49.domain.tld 'unknown error' (1): call=164, status=Timed Out, last-rc-change='Fri Feb 20 15:40:32 2015', queued=0ms, exec=60000ms
    p_neutron-l3-agent_stop_0 on node-49.domain.tld 'unknown error' (1): call=164, status=Timed Out, last-rc-change='Fri Feb 20 15:40:32 2015', queued=0ms, exec=60000ms
    p_neutron-metadata-agent_monitor_60000 on node-49.domain.tld 'unknown error' (1): call=640, status=Timed Out, last-rc-change='Tue Feb 24 06:48:14 2015', queued=0ms, exec=0ms
    ping_vip__public_monitor_20000 on node-44.domain.tld 'unknown error' (1): call=557, status=Timed Out, last-rc-change='Mon Feb 23 15:40:47 2015', queued=0ms, exec=0ms
    p_neutron-openvswitch-agent_monitor_20000 on node-44.domain.tld 'unknown error' (1): call=732, status=Timed Out, last-rc-change='Tue Feb 24 05:40:51 2015', queued=0ms, exec=0ms
    p_neutron-dhcp-agent_monitor_20000 on node-44.domain.tld 'unknown error' (1): call=238, status=Timed Out, last-rc-change='Fri Feb 20 22:50:50 2015', queued=0ms, exec=0ms
    p_neutron-l3-agent_stop_0 on node-44.domain.tld 'unknown error' (1): call=164, status=Timed Out, last-rc-change='Fri Feb 20 10:40:19 2015', queued=0ms, exec=60001ms
    p_neutron-l3-agent_stop_0 on node-44.domain.tld 'unknown error' (1): call=164, status=Timed Out, last-rc-change='Fri Feb 20 10:40:19 2015', queued=0ms, exec=60001ms
    p_neutron-metadata-agent_monitor_60000 on node-44.domain.tld 'unknown error' (1): call=602, status=Timed Out, last-rc-change='Mon Feb 23 21:20:21 2015', queued=0ms, exec=0ms

Tags: