oslo.messaging

rpc core should abort a call() early if the connection is terminated before the timeout period expires

Bug #1368917 reported by Chris Friesen on 2014-09-12

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Invalid	Undecided	Unassigned
	oslo.messaging	Confirmed	Medium	Ken Giusti

Bug Description

As it stands, if a client issuing an RPC call() sends a message to the rabbitmq server, then the rabbitmq server does a switchover/failover the client will wait for the full RPC timeout period (60 seconds) even though new rabbitmq server has come up long before then and some connections have been reestablished.

On a controlled switchover especially the RPC core should notice that the server has gone away and should notify any entities waiting for an RPC call() response so that they can error out early rather than waiting for the full RPC timeout period.

This was detected on Havana, but it seems to apply to all other versions as well.

See original description

Tags:

Revision history for this message

Sean Dague (sdague) wrote on 2014-09-12:

I think this really is an olso.messaging bug, these things are mostly left up to that lib at this point

Changed in nova:
status:	New → Invalid

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2014-09-15:

Does the client not reconnect, or does the reconnection not fix the issue somehow?

Changed in oslo.messaging:
status:	New → Incomplete

Revision history for this message

Chris Friesen (cbf123) wrote on 2014-09-15:

The client waits for the full RPC timeout period (60 secs by default) and then reconnects. However, in an HA environment the newly-active rabbitmq server will be up and available long before then.

Also, on a controlled switchover the rabbitmq server will have done a formal shutdown of the connection which would notify the client that the connection is being shut down. Something on the client should notice that the server has gone away long before the RPC timeout is over, and alert anyone waiting for an RPC response that they're never going to get one.

description:

updated

Chris Friesen (cbf123) on 2015-02-12

Changed in oslo.messaging:
status:	Incomplete → New

Doug Hellmann (doug-hellmann) on 2015-03-02

Changed in oslo.messaging:
status:	New → Confirmed
importance:	Undecided → Low

Revision history for this message

QingchuanHao (haoqingchuan-28) wrote on 2015-12-19:

Client can reconnect to another rabbitmq-server if one of rabbitmq-server that the client is connecting is down(if you configure multiple rabbitmq-servers by enabling rabbit_hosts or haproxy like proxy), and rabbitmq-servers in a cluster hold the synchronized infos, msgs etc
Did I missed something? @Sean Dague

Revision history for this message

Ken Giusti (kgiusti) wrote on 2018-07-05:

There is a new feature in oslo.messaging that will detect the loss of the message bus and immediately raise a message delivery failure.

https://git.openstack.org/cgit/openstack/oslo.messaging/tree/releasenotes/notes/RPC-call-monitoring-7977f047d069769a.yaml

This new feature should address this issue - moving bug to 'in-progress'

Changed in oslo.messaging:
status:	Confirmed → In Progress
assignee:	nobody → Ken Giusti (kgiusti)

Revision history for this message

Chris Friesen (cbf123) wrote on 2018-07-09:

This will probably reduce the time window, but it seems weird that the client is essentially ignoring the socket shutdown during a controlled failover.

Revision history for this message

Ken Giusti (kgiusti) wrote on 2018-07-09:

Fair point, and a good question.
Let's keep this open as a separate bug - need to determine how the pyamqp library surfaces the socket close event.

Changed in oslo.messaging:
status:	In Progress → Confirmed
importance:	Low → Medium

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.