Timed out waiting for a reply via rabbit
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Fix Released
|
Undecided
|
Unassigned | ||
oslo.messaging |
Fix Released
|
High
|
Mehdi Abaakouk | ||
oslo.messaging (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Trusty |
Fix Released
|
High
|
James Page | ||
Utopic |
Won't Fix
|
High
|
Jorge Niedbalski | ||
Vivid |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Icehouse
oslo-messaging 1.3.0 (stable/icehouse branch from github, c21a2c2)
rabbitmq-server 3.1.3
Nova rpc calls fail often after rabbit restarts. I've tracked it down to oslo/rabbit/kombu timing out if it's forced to reconnect to rabbit. The code below times out waiting for a reply if the topic has been used in a previous run. The reply always arrives the first time a topic is used, or if the topic is none. But, the second run with the same topic will hang with this error:
MessagingTimeout: Timed out waiting for a reply to message ID ...
This problem seems too basic to not be caught earlier in oslo, but the program below does really reproduce the same symptoms we see in nova when run against a live rabbit server.
Here's a log from a test run: https:/
#! /usr/bin/python
from oslo.config import cfg
import threading
from oslo import messaging
import logging
import time
log = logging.
class OsloTest():
def test(self):
# The code below times out waiting for a reply if the topic
# has been used in a previous run. The reply always arrives
# the first time a topic is used, or if the topic is none.
# But, the second run with the same topic will hang with this
# error:
#
# MessagingTimeout: Timed out waiting for a reply to message ID ...
#
topic = 'will_hang_
#topic = None # never hangs
url = "%(proto)
proto = 'rabbit',
host = '1.2.3.4',
user = 'rabbit-mq-user',
)
transport = messaging.
driver = transport._driver
target = messaging.
listener = driver.
timeout = 10
def send_main():
reply = driver.send(target,
# times out if topic was not None and used before
send_thread = threading.
msg = listener.poll()
if __name__ == '__main__':
FORMAT = '%(asctime)-15s %(process)5d %(thread)5d %(filename)s %(funcName)s %(message)s'
logging.
OsloTest(
---- ---- ---- ---- ----
[Impact]
* This patch along with those from LP #1400268 and LP #1408370 fixes rabbitmq reconnects
* We are backporting this to Icehouse since oslo.messaging 1.3.0
fails to reconnect to Rabbit properly, particularly nova-compute.
* This patch alond with it's dependencies metnioend above, will ensure that
multiple reconnect attempts happen by having connections timout and retry.
[Test Case]
* Start a service that uses oslo.messaging with rabbitmq e.g. nova-compute
* Stop rabbitmq while tail-F /var/log/
* Observe that nova-compute amqp times out and it is trying to reconnect
* Restart rabbitmq
* Observe that rabbitmq connection has re-established
[Regression Potential]
* None. I have tested in my local cloud environment and it appears to be
reliable.
Related branches
- Chuck Short (community): Needs Fixing
- Corey Bryant: Pending requested
-
Diff: 196 lines (+176/-0)3 files modifieddebian/changelog (+9/-0)
debian/patches/0006-Declare-DirectPublisher-exchanges-with-passive-True.patch (+166/-0)
debian/patches/series (+1/-0)
no longer affects: | oslo |
Changed in oslo.messaging: | |
status: | New → Confirmed |
Changed in oslo.messaging: | |
assignee: | nobody → Noel Burton-Krahn (noel-burton-krahn) |
status: | Confirmed → In Progress |
Changed in oslo.messaging: | |
status: | In Progress → Incomplete |
status: | Incomplete → In Progress |
Changed in oslo.messaging: | |
assignee: | Andrey Pavlov (apavlov-e) → nobody |
Changed in oslo.messaging: | |
status: | Incomplete → Confirmed |
Changed in oslo.messaging: | |
assignee: | Vish Ishaya (vishvananda) → Mehdi Abaakouk (sileht) |
Changed in oslo.messaging: | |
importance: | Undecided → High |
milestone: | none → next-kilo |
Changed in oslo.messaging: | |
milestone: | 1.6.0 → next-kilo |
Changed in oslo.messaging: | |
status: | Fix Committed → Fix Released |
Changed in oslo.messaging (Ubuntu Vivid): | |
status: | New → Fix Released |
Changed in oslo.messaging (Ubuntu Utopic): | |
status: | New → Fix Released |
Changed in oslo.messaging (Ubuntu Trusty): | |
status: | New → Confirmed |
importance: | Undecided → High |
description: | updated |
Changed in oslo.messaging (Ubuntu Trusty): | |
status: | Confirmed → In Progress |
tags: | added: icehouse-backport-potential |
Changed in oslo.messaging (Ubuntu Trusty): | |
assignee: | nobody → James Page (james-page) |
Changed in oslo.messaging (Ubuntu Utopic): | |
status: | Fix Released → New |
importance: | Undecided → High |
Changed in cloud-archive: | |
status: | New → Fix Released |
The code above was missing a call to acknowledge(), so it kept reading stale messages. However, with acknowedge(), it still reproduces the original bug. Here's how I reproduce the original problem. I verify the cluster is working, then kill rabbit, then try again.
1. start cluster, create vms, migrate: ok
2. kill and restart rabbit
3. migrate vm: timeout
Here's a trace from https:/ /gist.github. com/noelbk/ 619426fa88c3bdd 0534c after rabbit was restarted. This
pattern repeats a few times during a migration after rabbit was
restarted:
19:24:56 10.35.0.13 sends _msg_id = 29c6579c5de24c0 0b0b0e55579b8e0 47 0b0b0e55579b8e0 47 0b0b0e55579b8e0 47 0b0b0e55579b8e0 47
19:24:56 10.35.0.14 receives _msg_id = 29c6579c5de24c0
19:24:56 10.35.0.14 acknowledges _msg_id = 29c6579c5de24c0
19:25:56 10.35.0.13 Timed out waiting for a reply to message ID 29c6579c5de24c0
I'm instrumenting the rpc calls now to see if they all actually do retry and complete after the timeout errors. I'm trying to get a trace of all the rpc calls to see if they're being acknowledged but not replied to in time.
While digging through oslo.messaging, I noticed that in amqpdriver.py, the incoming queues in ReplyWaiter and AMQPListener are plain arrays, not thread-safe Queues. ReplyWaiter does acquire a lock, but I'm not sure if the plain arrays are always thread-safe. Not sure if this is causing my issue.
Still digging...