n-cpu raising MessageUndeliverable when replying to RPC call
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Fix Released
|
Undecided
|
Unassigned | ||
Ussuri |
Fix Committed
|
High
|
Unassigned | ||
Victoria |
Fix Committed
|
High
|
Unassigned | ||
oslo.messaging |
Confirmed
|
High
|
Herve Beraud | ||
python-oslo.messaging (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Focal |
Fix Committed
|
High
|
Unassigned |
Bug Description
Summary
=======
Recently, on train/OSP16.1 we noticed `MessageUndeliv
Indeed, on oslo.messaging, in a normal situation, `MessageUndeliv
However, I think that those raised within nova are due to a limitation of RabbitMQ's RPC direct reply-to feature.
Also, I think that this feature (MessageUndeliv
Observed Bug
============
Here in Nova on the server side (nova-compute) we can observe the following traceback:
```
2020-10-30 16:32:54.059 8 ERROR oslo_messaging.
```
Still on Nova and on the client (nova-api) we can observe the following traceback:
```
c767e1727b13484
```
The client never received the response and then reached a timeout because the message wasn't routed "MessageUndeliv
That lead us to a Nova issue where Volume attachment is failing, on Cinder it shows as available but nova shows it as attached. nova-api calling through RPC `reserve_
This leaves a block_device_
Details about the `direct_
=======
In a normal situation this mandatory flag tells the server how to react if a message cannot be routed to a queue. Specifically, if mandatory is set and after running the bindings the message was placed on zero queues then the message is returned to the sender (with a basic.return). If mandatory had not been set under the same circumstances the server would silently drop the message.
By disabling the mandatory flag if a reply_queue doesn't exist you will fall in MessagingTimeout . if `direct_
The Root Cause
==============
I think that with RPC server's direct reply the `direct_
First let's start by describing a bit how RPC server works with RabbitMQ.
The RPC server(s) consume requests from this queue and then send replies to each client using the queue named by the client in the reply-to header.
A client have two options:
- declare a single-use queue for each request-response pair.
- can create a long-lived queue for its replies.
The direct reply-to feature allows RPC clients to receive replies directly from their RPC server, without going through a reply queue. "Directly" here still means going through the same connection and a RabbitMQ node; there is no direct network connection between RPC client and RPC server processes.
The RPC server will then see a reply-to property with a generated name. It should publish to the default exchange ("") with the routing key set to this value (i.e. just as if it were sending to a reply queue as usual). The message will then be sent straight to the client consumer.
However this feature have some caveats and limitations [5]...
Especially the fact that the name `amq.rabbitmq.
If the RPC server publishes with the mandatory flag set then `amq.rabbitmq.
And we are now back to our previously observed behaviour... nova-compute's RPC server by replying by sending a direct `reply-to` to the client saw these message non routed (c.f the server traceback above). After awhile the client reached the timeout as the message was never delivered to him, and then this had for side effect to left a block_device_
Solutions
=========
Workaround
~~~~~~~~~~
I think this bug could be easily solved by disabling the `direct_
If the customers/operators seeing this repeatedly in their environment then they could try to disable the `[oslo_
Short Term Solution
~~~~~~~~~~~~~~~~~~~
The `direct_
Middle Term Solution
~~~~~~~
I think that the `direct_
Anyway I think that explicit is better than implicit and if some doubt remain then they must be cleared up.
Long Term Solution
~~~~~~~~~~~~~~~~~~
I think that we shouldn't rely on something else that real queues. Real queues are a bit more costly to use in term of performance, each solution have some drawbacks:
- single-use queue for each request-response pair can be expensive to create and then delete.
- long-lived queue for replying can be fiddly to manage, especially if the client itself is not long-lived.
However I think we should avoid using "non real queue". I think we should give the priority to more reliance/stability than performance.
Also real queues could be more easily monitored that direct reply-to. It could allow to operators to become a bit more proactive on similar issue by monitoring reply queues as soon as strange behaviour appear between RPC client/server.
RabbitMQ offer to us many HA features [9] that we could take benefit. Especially, the Quorum queues [10], maybe it could be a track to follow to allow to us to use real queue with RPC responses, and monitor if everything is OK by continuing to use the `direct_
This could lead us to important changes in our design, so I think this should be discussed through a dedicated blueprint to allow us to bring the better solution possible.
Conclusion
==========
I think we will soon see appear similar issues even outside of Nova also due to the described issue. However a mere workaround is available for now.
I think that if service start to observe similar coarst cues, then they must starts to disable the `direct_
I don't think it's necessary to blacklist the versions of oslo.messaging that contains this feature, because it can be disable, and that will deprive us of other needed bugfix released since.
Fortunatelly some tracks to follow are available to improve the things.
Hopefully it will help us to surround this corner case.
Thanks for your reading!
Hervé Beraud (hberaud)
[1] https:/
[2] https:/
[3] https:/
[4] https:/
[5] https:/
[6] https:/
[7] https:/
[8] https:/
[9] https:/
[10] https:/
Changed in oslo.messaging: | |
assignee: | nobody → Herve Beraud (herveberaud) |
Changed in oslo.messaging: | |
status: | New → Confirmed |
importance: | Undecided → High |
Changed in python-oslo.messaging (Ubuntu Focal): | |
status: | New → Triaged |
importance: | Undecided → High |
Changed in python-oslo.messaging (Ubuntu): | |
status: | New → Fix Released |
Changed in cloud-archive: | |
status: | New → Fix Released |
tags: | added: sts |
I think there is some conflation of terminology here.
RabbitMQ *does* have the "direct reply-to" feature described here:
https:/ /www.rabbitmq. com/direct- reply-to. html
However, in the oslo.messaging rabbitmq driver, the references to "direct" is not referring to the same thing. Instead, it's simply differentiating the "direct" exchange type from the others, "fanout" and "topic" ("headers" is unused):
https:/ /www.rabbitmq. com/tutorials/ amqp-concepts. html#exchanges
As far as I can tell, within the intended RPC architecture, the mandatory flag is working as intended here. There is no special reply-to shortcut being used. The RPC sender (should) declare the reply_XXX queue/binding/ exchange triplet that the RPC server uses to return the reply. However if the server can't send the reply back for some reason (error somewhere in that exchange- >binding- >queue flow) then we *do* want to raise the error, otherwise we would blackhole the msg and the sender would never know it was lost.
I've seen this same/similar behavior in the past but I've never been able to reproduce it successfully. See for example from 3+ years ago:
https:/ /bugzilla. redhat. com/show_ bug.cgi? id=1484543# c5
Something happens to break the normal exchange- >binding- >queue flow. I'm not sure if this is something internal to RabbitMQ where object(s) get lost/corrupted during failover? Or perhaps this is a bug in oslo.messaging/ kombu/pyamqp wherein the necessary objects are not properly redeclared on reconnection.