So I took a look at the environment and uploaded a diagnostic snapshot.
Errors in nova-compute logs are caused by the fact that haproxy terminates the HTTP requests to Cinder (initializing of volume connection) after 60s timeout:
HTTP POST to cinder-api is processed *synchronously* as API waits for return of an RPC call to cinder-volume. The latter is for some reason slow on processing of the RPC call - looks like it gets stuck somehere in Ceph calls - by the time it's ready to send an RPC reply rabbitmq closes the connection (heartbeats missed?).
Eventually, cinder-volume retries and delivers a message back to cinder-api, but it's too late and haproxy already closed the connection on its side.
So I took a look at the environment and uploaded a diagnostic snapshot.
Errors in nova-compute logs are caused by the fact that haproxy terminates the HTTP requests to Cinder (initializing of volume connection) after 60s timeout:
http:// paste.openstack .org/show/ 485411/ paste.openstack .org/show/ 485417/
http://
I checked atop logs to figure out if we are CPU bound or not (cinder-api runs in active-backup mode):
http:// paste.openstack .org/show/ 485413/
The CPU usage is rather high, still it must be ok for a 12-core server.
cinder-volume / cinder-api interaction is interesting:
http:// paste.openstack .org/show/ 485428/
HTTP POST to cinder-api is processed *synchronously* as API waits for return of an RPC call to cinder-volume. The latter is for some reason slow on processing of the RPC call - looks like it gets stuck somehere in Ceph calls - by the time it's ready to send an RPC reply rabbitmq closes the connection (heartbeats missed?).
Eventually, cinder-volume retries and delivers a message back to cinder-api, but it's too late and haproxy already closed the connection on its side.