Comment 11 for bug 1533197

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

So I took a look at the environment and uploaded a diagnostic snapshot.

Errors in nova-compute logs are caused by the fact that haproxy terminates the HTTP requests to Cinder (initializing of volume connection) after 60s timeout:

http://paste.openstack.org/show/485411/
http://paste.openstack.org/show/485417/

I checked atop logs to figure out if we are CPU bound or not (cinder-api runs in active-backup mode):

http://paste.openstack.org/show/485413/

The CPU usage is rather high, still it must be ok for a 12-core server.

cinder-volume / cinder-api interaction is interesting:

http://paste.openstack.org/show/485428/

HTTP POST to cinder-api is processed *synchronously* as API waits for return of an RPC call to cinder-volume. The latter is for some reason slow on processing of the RPC call - looks like it gets stuck somehere in Ceph calls - by the time it's ready to send an RPC reply rabbitmq closes the connection (heartbeats missed?).

Eventually, cinder-volume retries and delivers a message back to cinder-api, but it's too late and haproxy already closed the connection on its side.