ZeroMQ cast timeout ineffective

Bug #1193439 reported by Erica Windisch
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
oslo.messaging
Invalid
Undecided
Li Ma

Bug Description

The ZeroMQ driver is supposed to timeout on cast() and stop attempting to send a message per the expiration of the cast timeout. However, the ZeroMQ library call to send() is relatively non-blocking (it can block, but only blocks on putting a message into the queue, it doesn't block until a message is delivered).

Because of this, socket.close() is always called immediately after doing send(). This isn't a problem because linger=-1 is set on socket close.

Because linger is set to -1 by default and is not overridden, ZeroMQ does not simply stop attempting to send messages after we close the socket and release the reference from Python. Instead, while we garbage collect on Python's side, the C side keeps the message alive.

The present behavior will allow sockets to close should they successfully send a message. Sending failures will leave a hanging file descriptor and will retry unto infinity.

The solution is not to use Eventlet's timeout, but to use the ZeroMQ linger argument correctly. This also has the positive benefit of removing some reliance on Eventlet itself.

Tags: zmq
Changed in oslo:
assignee: nobody → Eric Windisch (ewindisch)
status: New → In Progress
Revision history for this message
Mark McLoughlin (markmc) wrote :

You say "Someone found this in testing and I'm having them confirm it fixes their problem."

Could you describe the exact symptoms seen by the user?

Changed in oslo:
status: In Progress → Incomplete
Changed in oslo:
status: Incomplete → In Progress
Revision history for this message
Erica Windisch (ewindisch) wrote :

Mark:

Messages that cannot be delivered will be attempted until delivery succeeds or until service restart. The cast timeout is not respected. This causes a file-descriptor leak and unnecessary network traffic. This was discovered by a user that had an offline machine that was being sent many messages by many machines, flooding their network.

affects: oslo-incubator → oslo.messaging
Revision history for this message
Li Ma (nick-ma-z) wrote :

Hi all, this problem is still there. The linger option doesn't take effect, when I set it greater than 0.

tags: added: zmq
Li Ma (nick-ma-z)
Changed in oslo.messaging:
assignee: Eric Windisch (ewindisch) → Li Ma (nick-ma-z)
Changed in oslo.messaging:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.