n-cpu raising MessageUndeliverable when replying to RPC call

Bug #1905965 reported by Herve Beraud
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Ussuri
Fix Committed
High
Unassigned
Victoria
Fix Committed
High
Unassigned
oslo.messaging
Confirmed
High
Herve Beraud
python-oslo.messaging (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Committed
High
Unassigned

Bug Description

Summary
=======

Recently, on train/OSP16.1 we noticed `MessageUndeliverable` when replying to RPC calls on nova [1]. I think that those exception raised within nova aren't legit and are a new bug in oslo.messaging.

Indeed, on oslo.messaging, in a normal situation, `MessageUndeliverable` are raised when a message is sent to a reply queue that doesn't exist for some reason. This new behaviour was introduced with oslo.messaging 10.2.0, by adding the mandatory flag for direct sending [2][3]. This is an expected behaviour to early detect an error in case the reply queue does not exist.

However, I think that those raised within nova are due to a limitation of RabbitMQ's RPC direct reply-to feature.

Also, I think that this feature (MessageUndeliverable to the mandatory flag) introduced unexpected side effects on Nova where RPC client Timed out due to server's reply messages which finished non routed and so underlivered to the client.

Observed Bug
============

Here in Nova on the server side (nova-compute) we can observe the following traceback:

```
2020-10-30 16:32:54.059 8 ERROR oslo_messaging.rpc.server [req-99a5cda6-7c8e-4cba-88f4-37b3447d4dbd c767e1727b1348449b355ea2f6c529f3 0b0b4de19ae94554a8f8b2c949306456 - default default] MessageUndeliverable error, source exception: Basic.return: (312) NO_ROUTE, routing_key: reply_d2ac09b0671840d39da6a9c718b5a63f, exchange: : : oslo_messaging.exceptions.MessageUndeliverable
```

Still on Nova and on the client (nova-api) we can observe the following traceback:

```
c767e1727b1348449b355ea2f6c529f3 0b0b4de19ae94554a8f8b2c949306456 - default default] Unexpected exception in API method: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID cb84a7fc264748aa9f19ddab48311975
```

The client never received the response and then reached a timeout because the message wasn't routed "MessageUndeliverable error, source exception: Basic.return: (312) NO_ROUTE, routing_key: reply_d2ac09b0671840d39da6a9c718b5a63f, exchange: : : ".

That lead us to a Nova issue where Volume attachment is failing, on Cinder it shows as available but nova shows it as attached. nova-api calling through RPC `reserve_block_device_name` on the compute to and then timing out after 10mins because the server never respond to the call. In fact the server respond but the message is non routed to its destination (c.f the server traceback above).

This leaves a block_device_mapping record in the database, leaving it listed in the os-volume_attachments API and `nova volume-attachments` output. This also causes the subsequent attempts to attach the volume to fail as we already have an active bdm record for it.

Details about the `direct_mandatory_flag` feature of oslo.messaging
===================================================================

In a normal situation this mandatory flag tells the server how to react if a message cannot be routed to a queue. Specifically, if mandatory is set and after running the bindings the message was placed on zero queues then the message is returned to the sender (with a basic.return). If mandatory had not been set under the same circumstances the server would silently drop the message.

By disabling the mandatory flag if a reply_queue doesn't exist you will fall in MessagingTimeout . if `direct_mandatory_flag` [4] is equal to True then `MessageUndeliverable` is raised immediatly when a reply queue doesn't exist, by default if this option (`direct_mandatory_flag`) doesn't exist (oslo.messaging < 10.2.0) or equal to False then you have to wait the default timeout.

The Root Cause
==============

I think that with RPC server's direct reply the `direct_mandatory_flag` feature doesn't work as expected.

First let's start by describing a bit how RPC server works with RabbitMQ.

The RPC server(s) consume requests from this queue and then send replies to each client using the queue named by the client in the reply-to header.

A client have two options:
- declare a single-use queue for each request-response pair.
- can create a long-lived queue for its replies.

The direct reply-to feature allows RPC clients to receive replies directly from their RPC server, without going through a reply queue. "Directly" here still means going through the same connection and a RabbitMQ node; there is no direct network connection between RPC client and RPC server processes.

The RPC server will then see a reply-to property with a generated name. It should publish to the default exchange ("") with the routing key set to this value (i.e. just as if it were sending to a reply queue as usual). The message will then be sent straight to the client consumer.

However this feature have some caveats and limitations [5]...

Especially the fact that the name `amq.rabbitmq.reply-to` is used in `basic.consume` and the `reply-to` property as if it were a queue; however it is not. It cannot be deleted, and does not appear in the management plugin or `rabbitmqctl list_queues`.

If the RPC server publishes with the mandatory flag set then `amq.rabbitmq.reply-to.*` is treated as not a queue [5]; i.e. if the server only publishes to this name then the message will be considered "not routed" [5][6]; a `basic.return` will be sent if the mandatory flag was set.

And we are now back to our previously observed behaviour... nova-compute's RPC server by replying by sending a direct `reply-to` to the client saw these message non routed (c.f the server traceback above). After awhile the client reached the timeout as the message was never delivered to him, and then this had for side effect to left a block_device_mapping record in the database, leaving it listed in the os-volume_attachments API and `nova volume-attachments` output. This also causes the subsequent attempts to attach the volume to fail as we already have an active bdm record for it.

Solutions
=========

Workaround
~~~~~~~~~~

I think this bug could be easily solved by disabling the `direct_mandatory_flag` option, it could be a simple workaround to unlock those who face a similar issue.

If the customers/operators seeing this repeatedly in their environment then they could try to disable the `[oslo_messaging_rabbit]/direct_mandatory_flag` option in the computes `nova.conf`.

Short Term Solution
~~~~~~~~~~~~~~~~~~~

The `direct_mandatory_flag` should be disabled by default to ensure that we wouldn't encounter an assault of similar issues from other side that Nova. Possibilly each user of oslo.messaging's RPC server can face a similar issue as this the default behavior, AFAIK server will always send a direct response with a `reply-to`.

Middle Term Solution
~~~~~~~~~~~~~~~~~~~~

I think that the `direct_mandatory_flag` usage should be removed/disabled of the `direct_send` [7] until we are capable to determine if we use the `reply-to` feature in parallel. I think we are in a grey area here and it looks like that the direct messaging of oslo.messaging implicitly introduce the usage of the `reply-to` [8], however isn't 100% clear to me, so feedback from more skilled person would appreciated.

Anyway I think that explicit is better than implicit and if some doubt remain then they must be cleared up.

Long Term Solution
~~~~~~~~~~~~~~~~~~

I think that we shouldn't rely on something else that real queues. Real queues are a bit more costly to use in term of performance, each solution have some drawbacks:

- single-use queue for each request-response pair can be expensive to create and then delete.
- long-lived queue for replying can be fiddly to manage, especially if the client itself is not long-lived.

However I think we should avoid using "non real queue". I think we should give the priority to more reliance/stability than performance.

Also real queues could be more easily monitored that direct reply-to. It could allow to operators to become a bit more proactive on similar issue by monitoring reply queues as soon as strange behaviour appear between RPC client/server.

RabbitMQ offer to us many HA features [9] that we could take benefit. Especially, the Quorum queues [10], maybe it could be a track to follow to allow to us to use real queue with RPC responses, and monitor if everything is OK by continuing to use the `direct_mandatory_flag`.

This could lead us to important changes in our design, so I think this should be discussed through a dedicated blueprint to allow us to bring the better solution possible.

Conclusion
==========

I think we will soon see appear similar issues even outside of Nova also due to the described issue. However a mere workaround is available for now.

I think that if service start to observe similar coarst cues, then they must starts to disable the `direct_mandatory_flag` feature in their config ASAP.

I don't think it's necessary to blacklist the versions of oslo.messaging that contains this feature, because it can be disable, and that will deprive us of other needed bugfix released since.

Fortunatelly some tracks to follow are available to improve the things.

Hopefully it will help us to surround this corner case.

Thanks for your reading!

Hervé Beraud (hberaud)

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1898578
[2] https://github.com/openstack/oslo.messaging/commit/b7e9faf6590086b79f9696183b5c296ecc12b7b6
[3] https://docs.openstack.org/oslo.messaging/latest/admin/rabbit.html#exchange
[4] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L172,L177
[5] https://www.rabbitmq.com/direct-reply-to.html#limitations
[6] https://www.rabbitmq.com/amqp-0-9-1-reference.html#constants
[7] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L1313,L1324
[8] https://docs.openstack.org/oslo.messaging/latest/admin/AMQP1.0.html#direct-messaging
[9] https://www.rabbitmq.com/ha.html
[10] https://www.rabbitmq.com/quorum-queues.html

Tags: sts
Changed in oslo.messaging:
assignee: nobody → Herve Beraud (herveberaud)
Revision history for this message
John Eckersberg (jeckersb) wrote :

I think there is some conflation of terminology here.

RabbitMQ *does* have the "direct reply-to" feature described here:

https://www.rabbitmq.com/direct-reply-to.html

However, in the oslo.messaging rabbitmq driver, the references to "direct" is not referring to the same thing. Instead, it's simply differentiating the "direct" exchange type from the others, "fanout" and "topic" ("headers" is unused):

https://www.rabbitmq.com/tutorials/amqp-concepts.html#exchanges

As far as I can tell, within the intended RPC architecture, the mandatory flag is working as intended here. There is no special reply-to shortcut being used. The RPC sender (should) declare the reply_XXX queue/binding/exchange triplet that the RPC server uses to return the reply. However if the server can't send the reply back for some reason (error somewhere in that exchange->binding->queue flow) then we *do* want to raise the error, otherwise we would blackhole the msg and the sender would never know it was lost.

I've seen this same/similar behavior in the past but I've never been able to reproduce it successfully. See for example from 3+ years ago:

https://bugzilla.redhat.com/show_bug.cgi?id=1484543#c5

Something happens to break the normal exchange->binding->queue flow. I'm not sure if this is something internal to RabbitMQ where object(s) get lost/corrupted during failover? Or perhaps this is a bug in oslo.messaging/kombu/pyamqp wherein the necessary objects are not properly redeclared on reconnection.

Revision history for this message
Ken Giusti (kgiusti) wrote :

Thanks for the input John. You are correct - the current implementation of oslo.messaging does not use the special Direct Reply To feature provided by rabbitmq [0]. Whether or not it should is an entire discussion in itself :D

Currently oslo.messaging uses a "somewhat long lived" reply queue that is associated with the direct exchange. It is "somewhat long lived" in that its lifecycle is associated with the client's transport, and therefore will be used for all requests issued by the client (each individual request is identified by a correlation id - essentially multiplexing replies on the same queue).

Back the the issue at hand: if I understand you correctly oslo.messaging's current use of the mandatory flag - to force an error exception on the server when the reply cannot be delivered - is correct. Logically it seems correct to me as well.

However from the user's point of view (in the case nova) it's a breaking API change. Mea culpa - turning on the mandatory feature by default should not have been done without making a whole lot more public noise than I did.

Sounds like the best path forward would be to turn off the mandatory behavior by default (it's configurable) until folks have a chance to update their code to handle the new behavior.

[0] https://www.rabbitmq.com/direct-reply-to.html

Revision history for this message
Herve Beraud (herveberaud) wrote :

Thanks John for your feedback.

You're right I was misled by the terminology.

I agree with Ken, the mandatory flag is a good feature but the communication around that wasn't enough for our users.

I think that we should provide some examples on how to handle this kind of failure on the user side (service etc...), it could be fixed by providing some doc patches.

Do you've any idea of scenarios that we should provide on our examples?

Concerning nova I don't think that we broke something (by adding the mandatory flag). if I correctly understand John's analyze the issue could come from a connection issue or from an internal server issue, so even without the flag nova would face more or less the same issue.

So on one side I think we should improve our doc and our communication about this feature, and on the other side, I think we should provide some workaround to execute in the case if we face a similar scenario in the future. Apparently this kind of issue is hard to reproduce so it could be worth to provide a reusable workaround publicly referenced somewhere (in our doc, downstream?).

@John do you have some idea of workarounds to restore similar situation?

Revision history for this message
Satish Patel (satish-txt) wrote :

I am seeing same behavior with senlin project, Here is the detail - https://bugs.launchpad.net/senlin/+bug/1909016

Revision history for this message
Satish Patel (satish-txt) wrote :

When i am trying to add following value in /etc/senlin/senlin.conf

[DEFAULT]
transport_url = rabbit://senlin:94d7aecb853145779db8f1dcb@10.65.6.176:5671//senlin?ssl=1

[oslo_messaging_rabbit]
ssl = True
direct_mandatory_flag = False

I am getting error in logs

Dec 22 14:43:26 os-lab-infra-1-senlin-container-16f24bbe senlin-conductor[8591]: 2020-12-22 14:43:25.999 8591 ERROR oslo_service.service [-] Error starting t
hread.: oslo_config.cfg.ConfigFileValueError: Value for option direct_mandatory_flag from LocationInfo(location=<Locations.user: (4, True)>, detail='/etc/senlin/senlin.conf') is not
 valid: invalid literal for int() with base 10: 'False'

Ben Nemec (bnemec)
Changed in oslo.messaging:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
John Eckersberg (jeckersb) wrote :

re: Ken's comment:

"""
However from the user's point of view (in the case nova) it's a breaking API change. Mea culpa - turning on the mandatory feature by default should not have been done without making a whole lot more public noise than I did.
"""

I agree with this. The main issue is that oslo.messaging will now raise a MessageUndeliverable exception with the mandatory flag set. This exception was added here:

https://github.com/openstack/oslo.messaging/commit/c50076b4efb79cef46d618d6d80eecbcebb72898

And then later started getting raised when the mandatory flag was added to direct_send here:

https://github.com/openstack/oslo.messaging/commit/b7e9faf6590086b79f9696183b5c296ecc12b7b6

This is the detail that is API-breaking; nova and the like are not prepared to catch this exception and handle it.

However, we already do almost the exact same thing presently, albeit in a slightly different manner.

In the base amqpdriver here:

https://github.com/openstack/oslo.messaging/blob/e44c988/oslo_messaging/_drivers/amqpdriver.py#L151-L169

We catch AMQPDestinationNotFound and loop for missing_destination_retry_timeout seconds. This is accomplished by setting the passive=True flag on the exchange, and then probing to see if the exchange exists by issuing a exchange.declare. If the exchange is already present, exchange.declare-ok is returned. If the exchange is not present, a channel error is raised and the exchange is *not* declared. This is so the replying end can probe to see when the original requester re-establishes itself (it is on the sender to declare the exchange).

I think for now we could get away with simply catching the MessageUndeliverable exception in the same place, and we could loop for the same timeout while it remains undeliverable. The only problem I see is that the blueprint spec for the transport options is there explicitly so that the rpc client itself can declare at_least_once=True and expect to get back MessageUndeliverable if that fails. So we could keep track of the transport_options originally provided by the client. If the client explicitly set at_least_once=True, then when MessageUndeliverable is raised, we should re-raise it back to the client, with the assumption that it knows how to handle it (possibly after retrying some number of times?). However, if the client did not explicitly ask for it (the current case where nova and such do not use this API), then we can still use the functionality within oslo.messaging, but we need to catch it and not re-raise it, the same way AMQPDestinationNotFound works currently.

Revision history for this message
Ken Giusti (kgiusti) wrote :

+1 John that's a great idea.

Revision history for this message
Herve Beraud (herveberaud) wrote :

Hello,

Today I submitted changes related to John's proposition (c.f PS3 https://review.opendev.org/c/openstack/oslo.messaging/+/764776/3 ).

These changes are still "WIP" to allow us to discuss about the related details of the implementation.

The first that we could discuss about is can we drop the `direct_mandatory_flag` in favor of `at_least_once`.

I think it doesn't really make sense to keep the both params flags in parallel.

Either we want to re-raise the `MessageUndelivered` exception or to loop over a timeout, but I don't think we want to stop to use the mandatory flag option, isn't? For this reason I think we can simply define the `at_least_once` option as did in PS3 and simply abandon the `direct_mandatory_flag` option.

Also, notice that I tried to simulate the `MessageUndelivered` scenario by using oslo.messaging's simulator, unfortunatelly I couldn't reproduce the MessageUndeliverable exception.

Find bellow the followed process, maybe you will have an idea oon how to simulate that.

1. Run a rabbitmq server and allow to access to the dashboard to monitor things:

```
$ sudo podman run -d -e RABBITMQ_NODENAME=rabbitmq \
     -p 5672:5672 \
     -p 15672:15672 \
     --name rabbitmq rabbitmq:3-management
```

2. Prepare some configuration to use with our simulator:

```
# tuto.conf
[DEFAULT]
transport_url = rabbit://localhost/
at_least_once = False
```

3. Run the RPC server

```
$ tox -e venv -- python tools/simulator.py --config-file ./tuto.conf -d rpc-server -w 40
```

Notice that I asked to the server to wait before answer to allow me to manually delete the reply queue on the rabbitmq server side.

4. Run the RPC client

```
$ tox -e venv -- python tools/simulator.py --config-file ./tuto.conf -d rpc-client
```

5. Delete the corresponding reply queue

```
$ sudo podman exec -it rabbitmq rabbitmqctl delete_queue reply_<id>
```

Unfortunatelly I can't fall into the expected situation by following this process...

I suppose that some other mechanisms catch that something went wrong with this scenario and it could explain why I didn't reach my expected scenario.

Please let me know if these changes fit our expectations and if they make sense to you or if they needs changes.

Revision history for this message
Jing Zhang (jing.zhang.nokia) wrote :

Hi Hervé,

We can consistently reproduce the MessageUndelivered error in nova-compute when trying to attach 7 cinder volumes right after a VM is created, all via a heat template.

Maybe you can give use case a try.

The way to get rid of the error is: (1) Set direct_mandatory_flag to 0 for nova-compute, and use a very big rpc_response_timeout (240) in nova_api.

The same heat template runs OK in Rocky, we are also looking at why it fails now in Train.

Jing

Revision history for this message
Jing Zhang (jing.zhang.nokia) wrote :

This is to provide update for our issue,

(1) Use a very big rpc_response_time for nova_api can not reliably get rid of the issue

(2) Use single mod_wsgi thread for nova_api is able to resolve the issue reliably, plus the other "difficult" cinder volume test cases that run in Rocky are able to work in Train.

The following ticket filtered out the rabbitmq hearbeat errors in nova_api.log, but luckily the errors are still shown in rabbitmq.log, that helped us to nail down the issue.

https://bugs.launchpad.net/nova/+bug/1825584

Jing

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 12.7.1

This issue was fixed in the openstack/oslo.messaging 12.7.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 12.1.6

This issue was fixed in the openstack/oslo.messaging 12.1.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 12.5.2

This issue was fixed in the openstack/oslo.messaging 12.5.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 10.2.4

This issue was fixed in the openstack/oslo.messaging 10.2.4 release.

Changed in python-oslo.messaging (Ubuntu Focal):
status: New → Triaged
importance: Undecided → High
Changed in python-oslo.messaging (Ubuntu):
status: New → Fix Released
Changed in cloud-archive:
status: New → Fix Released
tags: added: sts
Revision history for this message
Corey Bryant (corey.bryant) wrote (last edit ):

This is being fixed via LP:1940858

Changed in python-oslo.messaging (Ubuntu Focal):
status: Triaged → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.