rpc core should abort a call() early if the connection is terminated before the timeout period expires

Bug #1368917 reported by Chris Friesen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
oslo.messaging
Confirmed
Medium
Ken Giusti

Bug Description

As it stands, if a client issuing an RPC call() sends a message to the rabbitmq server, then the rabbitmq server does a switchover/failover the client will wait for the full RPC timeout period (60 seconds) even though new rabbitmq server has come up long before then and some connections have been reestablished.

On a controlled switchover especially the RPC core should notice that the server has gone away and should notify any entities waiting for an RPC call() response so that they can error out early rather than waiting for the full RPC timeout period.

This was detected on Havana, but it seems to apply to all other versions as well.

Tags: oslo
Revision history for this message
Sean Dague (sdague) wrote :

I think this really is an olso.messaging bug, these things are mostly left up to that lib at this point

Changed in nova:
status: New → Invalid
Revision history for this message
Doug Hellmann (doug-hellmann) wrote :

Does the client not reconnect, or does the reconnection not fix the issue somehow?

Changed in oslo.messaging:
status: New → Incomplete
Revision history for this message
Chris Friesen (cbf123) wrote :

The client waits for the full RPC timeout period (60 secs by default) and then reconnects. However, in an HA environment the newly-active rabbitmq server will be up and available long before then.

Also, on a controlled switchover the rabbitmq server will have done a formal shutdown of the connection which would notify the client that the connection is being shut down. Something on the client should notice that the server has gone away long before the RPC timeout is over, and alert anyone waiting for an RPC response that they're never going to get one.

description: updated
Chris Friesen (cbf123)
Changed in oslo.messaging:
status: Incomplete → New
Changed in oslo.messaging:
status: New → Confirmed
importance: Undecided → Low
Revision history for this message
QingchuanHao (haoqingchuan-28) wrote :

Client can reconnect to another rabbitmq-server if one of rabbitmq-server that the client is connecting is down(if you configure multiple rabbitmq-servers by enabling rabbit_hosts or haproxy like proxy), and rabbitmq-servers in a cluster hold the synchronized infos, msgs etc
Did I missed something? @Sean Dague

Revision history for this message
Ken Giusti (kgiusti) wrote :

There is a new feature in oslo.messaging that will detect the loss of the message bus and immediately raise a message delivery failure.

https://git.openstack.org/cgit/openstack/oslo.messaging/tree/releasenotes/notes/RPC-call-monitoring-7977f047d069769a.yaml

This new feature should address this issue - moving bug to 'in-progress'

Changed in oslo.messaging:
status: Confirmed → In Progress
assignee: nobody → Ken Giusti (kgiusti)
Revision history for this message
Chris Friesen (cbf123) wrote :

This will probably reduce the time window, but it seems weird that the client is essentially ignoring the socket shutdown during a controlled failover.

Revision history for this message
Ken Giusti (kgiusti) wrote :

Fair point, and a good question.
Let's keep this open as a separate bug - need to determine how the pyamqp library surfaces the socket close event.

Changed in oslo.messaging:
status: In Progress → Confirmed
importance: Low → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.