[SRU] MessageTimeout and DuplicateMessage errors after update
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Invalid
|
Undecided
|
Unassigned | ||
Queens |
Fix Released
|
Critical
|
Unassigned | ||
Rocky |
Fix Released
|
Critical
|
Unassigned | ||
Stein |
Fix Released
|
Critical
|
Unassigned | ||
Train |
New
|
Undecided
|
Unassigned | ||
oslo.messaging |
Invalid
|
Undecided
|
Unassigned | ||
python-oslo.messaging (Ubuntu) |
Invalid
|
Undecided
|
Unassigned | ||
Bionic |
Fix Released
|
Critical
|
Unassigned |
Bug Description
[Impact]
A recent update to oslo.messaging to resolve #1789177 causes failures.
(Below comments copied form the original bug):
After a partial upgrade (only one side, producers or consumers), there are a lot of MessageTimeout and DuplicateMessage errors in the logs. Downgrading back to 5.35.0-
Right after restarted n-ovs-agent, I can see a lot of errors in rabbitmq log[1]
which is the same as the error when rabbitmq failover issue ( the original issue of this LP )
Then after I upgraded oslo.messaging in neutron-api unit and restarted neutron-server, below errors are gone and I was able to create instance again.
After upgrading oslo.messaging in n-ovs only, exchange they communicate didn't match.
As changing exchanges they use depends on publisher-cosumer relation.
So I think there are two ways.
1. revert this patch for Q ( original failover problem will be there )
2. upgrade them with maintenance window
Thanks a lot
[1]
#######
=ERROR REPORT==== 3-Feb-2021:
Channel error on connection <0.2379.1> (10.0.0.32:60430 -> 10.0.0.34:5672, vhost: 'openstack', user: 'neutron'), channel 1:
{amqp_error,
"no exchange 'reply_
10.0.0.32 is neutron-api unit
[Test Case]
This SRU needs the following scenarios tested:
1) partial upgrade of n-ovs at 5.35.0-0ubuntu3 [1] and n-api/n-gateway at 5.35.0-0ubuntu1 - instance creation will be successful
2) partial upgrade of n-api/n-gateway at 5.35.0-0ubuntu3 [1] and n-ovs at 5.35.0-0ubuntu1 - instance creation will be successful
3) partial upgrade of n-ovs at 5.35.0-0ubuntu2 [1] and n-api/n-gateway at 5.35.0-0ubuntu3 - instance creation will fail (see regression potential)
4) partial upgrade of n-api/n-gateway at 5.35.0-0ubuntu3 [1] and n-ovs at 5.35.0-0ubuntu2 - instance creation will fail (see regression potential)
5) test all neutron nodes at 5.35.0-0ubunt3 - instance creation will be successful
[1] and neutron* services restarted
[Regression Potential]
There is regression potential for clouds that have already upgraded to 5.35.0-0ubuntu2. This needs to be tested but if a cloud has fully upgraded to 5.35.0-0ubuntu2, then the same disruption that this SRU is trying to solve may once again occur in a cloud with some services running 5.35.0-0ubuntu2 and some running 5.35.0-0ubuntu3. Once that cloud is entirely at 5.35.0-0ubuntu3, messages will no longer timeout.
summary: |
- [SRU] + [SRU] Recent update broke message handling |
summary: |
- [SRU] Recent update broke message handling + [SRU] MessageTimeout and DuplicateMessage errors after udpate |
Changed in python-oslo.messaging (Ubuntu Bionic): | |
status: | New → Triaged |
importance: | Undecided → Critical |
Changed in python-oslo.messaging (Ubuntu): | |
status: | New → Invalid |
Changed in cloud-archive: | |
status: | New → Invalid |
description: | updated |
Changed in oslo.messaging: | |
status: | New → Invalid |
summary: |
- [SRU] MessageTimeout and DuplicateMessage errors after udpate + [SRU] MessageTimeout and DuplicateMessage errors after update |
I'm marking this as not affecting Queens because the change that caused this regression didn't get out of queens-proposed.