Action monitor for ns_haproxy OCF takes too long under load
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Mirantis OpenStack |
Fix Committed
|
High
|
Bogdan Dobrelya | ||
5.0.x |
Won't Fix
|
High
|
Bogdan Dobrelya | ||
5.1.x |
Won't Fix
|
High
|
Bogdan Dobrelya | ||
6.0.x |
Fix Committed
|
High
|
Bogdan Dobrelya | ||
6.1.x |
Fix Committed
|
High
|
Bogdan Dobrelya |
Bug Description
[root@fuel ~]# fuel --fuel-version
api: '1.0'
astute_sha: 1f87a9b9a47de74
auth_required: true
build_id: 2015-02-13_16-48-42
build_number: '120'
feature_groups:
- mirantis
fuellib_sha: 7f8d4382abfcd43
fuelmain_sha: 4b6bcf60d712b74
nailgun_sha: 1e3a40dd8a17abe
ostf_sha: f9c37d0876141e1
production: docker
python-
release: '6.1'
release_versions:
2014.2-6.1:
VERSION:
api: '1.0'
astute_sha: 1f87a9b9a47de74
build_id: 2015-02-13_16-48-42
build_number: '120'
feature_
- mirantis
fuellib_sha: 7f8d4382abfcd43
fuelmain_sha: 4b6bcf60d712b74
nailgun_sha: 1e3a40dd8a17abe
ostf_sha: f9c37d0876141e1
production: docker
python-
release: '6.1'
Baremetal,Centos, HA, Neutron-
Controllers:3 Computes:47
deployment was successfull
I've added glance image due the bug https:/
glance image-create --name 'TestVM' --is-public true --container-
I've added the follwoing line to /usr/bin/
147 config.
I have face with the following exception during rally test:
2015-02-20 22:30:52.125 20140 ERROR rally.benchmark
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark
At the same time I have found in messages log on primary controller node:
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy Stats started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy horizon started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy keystone-1 started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy keystone-2 started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy nova-api-1 started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy nova-api-2 started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy nova-metadata-api started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy cinder-api started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy glance-api started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy neutron started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy glance-registry started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy rabbitmq started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy mysqld started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy radosgw started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy heat-api started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy heat-api-cfn started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy heat-api-cloudwatch started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy nova-novncproxy started.
from syslog log file:
<129>Feb 20 22:30:53 node-1 haproxy[13334]: Server radosgw/node-1 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 1ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Feb 20 22:30:53 node-1 haproxy[13334]: Server radosgw/node-44 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 1ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Feb 20 22:30:53 node-1 haproxy[13334]: Server radosgw/node-49 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 1ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Feb 20 22:30:53 node-1 haproxy[13334]: proxy radosgw has no server available!
from pacemaker log:
Feb 20 22:30:54 [23083] node-1.domain.tld lrmd: info: log_finished: finished - rsc:p_haproxy action:start call_id:241 pid:13266 exit-code:0 exec-time:1963ms queue-time:0ms
[root@node-1 ~]# crm status
Last updated: Tue Feb 24 07:49:38 2015
Last change: Thu Feb 19 23:39:25 2015
Stack: corosync
Current DC: node-1.domain.tld (1) - partition with quorum
Version: 1.1.11-97629de
3 Nodes configured
27 Resources configured
Online: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
vip__public (ocf::fuel:
Clone Set: clone_ping_
Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
vip__management (ocf::fuel:
Clone Set: clone_p_
Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
Clone Set: clone_p_
Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
p_neutron-
Clone Set: clone_p_
Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
Clone Set: clone_p_
p_
p_
p_
Clone Set: clone_p_mysql [p_mysql]
Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
Master/Slave Set: master_
Masters: [ node-1.domain.tld ]
Slaves: [ node-44.domain.tld node-49.domain.tld ]
Clone Set: clone_p_haproxy [p_haproxy]
Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
Failed actions:
ping_
p_neutron-
p_neutron-
p_neutron-
p_neutron-
p_neutron-
ping_
p_neutron-
p_neutron-
p_neutron-
p_neutron-
p_neutron-
ping_
p_neutron-
p_neutron-
p_neutron-
p_neutron-
p_neutron-
Changed in mos: | |
assignee: | Fuel Library Team (fuel-library) → svasheka (svasheka) |
According to the logs from the affected env, haproxy instances at node-44, node-49 wasn't restarted so the downtime was temporary and local for node-1. This should be considered as a medium impact I believe.
The instance of haproxy at node-1 was restarted due to timeout of monitor operation: callback: p_haproxy_ monitor_ 20000 process (PID 12917) timed out monitor_ 20000:12917 - timed out after 10000ms monitor_ 20000: Timed Out (node=node- 1.domain. tld, call=238, timeout=10000ms)
pacemaker.log:
Feb 20 22:30:50 [23083] node-1.domain.tld lrmd: warning: child_timeout_
Feb 20 22:30:51 [23083] node-1.domain.tld lrmd: warning: operation_finished: p_haproxy_
Feb 20 22:30:51 [23086] node-1.domain.tld crmd: error: process_lrm_event: Operation p_haproxy_
The RC is yet unknown but might be related with critical level of IO load:
atop 2015/02/20 22:30:33 - 2015/02/20 22:30:53:
LVM | os-root | busy 95% | read 1782 | write 432390 | KiB/r 35 | KiB/w 3 | MBr/s 3.10 | MBw/s 84.45 | avio 0.04 ms |
DSK | sda | busy 95% | read 1881 | write 3948 | KiB/r 34 | KiB/w 457 | MBr/s 3.16 | MBw/s 88.25 | avio 3.23 ms