Olso amqpdriver doesn't always honor reply timeout
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
oslo.messaging |
Fix Released
|
Undecided
|
Joshua Harlow |
Bug Description
The way the Oslo amqpdriver is set up for callers that are waiting for replies/responses from a message is that the first thread in will sit in the _poll_connection method in the amqpdriver.py module calling down to the underlying messaging driver with the timeout that that caller specified, and then all other callers waiting for replies on the same reply queue will just get blocked waiting for the given reply.
This ReplyWaiter's _poll_connection method will on a reply check to see if this is the message this caller is waiting for and return if it is otherwise will notify the other callers waiting of the message and will go back waiting with the original timeout specified, and will only timeout if no messages come in to that reply queue within the overall timeout specified. The issue with this is that the original timeout specified is only really honored if no other messages come into that reply queue in the period of time, otherwise at minimum it will end up waiting a little longer than specified and at worst will hang forever.
This is the case that happens in for example nova-compute where you have a thread doing the service updates to the conductor and also have periodic tasks doing updates to the conductor. If the reply to one of those threads (says the service update thread) never responds (the services or message broker get restarted in the middle, some other action hangs, network issue, etc) or even takes a really long time, that thread will be stuck in the _poll_connection method, and then the periodic task thread comes in and makes calls to the conductor and the replies are received. If the interval for that other thread is less than the timeout is set to for replies, the service update thread will be stuck in _poll_connection forever.
To fix this issue, I think a similar fix needs to be done as is done in the AMQPListener's "poll" method in the same module, where it keeps track of the time that it has waited in that method and only will try to wait for the amount of time delta, so that it will honor the original timeout. This isn't ideal in a situation where the time changes (forward or backwards) on the system, but there seems to already be precedence for this in the poll method.
Here is the code snippet for the change that seems to work to me:
def _poll_connectio
if timeout is not None and timeout > 0: <===== Line Added
else: <===== Line Added
while True:
while self.incoming:
if incoming_msg_id == msg_id:
try:
if deadline is not None: <===== Line Added
except rpc_common.Timeout:
Changed in oslo.messaging: | |
assignee: | nobody → Mehdi Abaakouk (sileht) |
status: | New → In Progress |
Changed in oslo.messaging: | |
assignee: | Mehdi Abaakouk (sileht) → Joshua Harlow (harlowja) |
Changed in oslo.messaging: | |
milestone: | none → 1.5.0 |
status: | Fix Committed → Fix Released |
https:/ /review. openstack. org/#/c/ 137456/