qmgr process loads the system when using rate_* in custom transports
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
postfix (Ubuntu) |
Fix Released
|
Medium
|
Unassigned |
Bug Description
Last month I had a "load average" issue in a postfix mail server (only runs postfix service). Suddenly, load average started to raise and qmgr process appeared on top of "top" taking 20-30% of CPU.
top - 18:19:54 up 7 days, 2:03, 2 users, load average: 4.94, 3.96, 4.02
Tasks: 144 total, 6 running, 138 sleeping, 0 stopped, 0 zombie
Cpu(s): 48.3%us, 50.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 1.0%si, 0.0%st
Mem: 1035280k total, 999964k used, 35316k free, 149072k buffers
Swap: 750696k total, 88k used, 750608k free, 599308k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23665 postfix 20 0 5880 2628 1792 S 20.3 0.3 68:11.18 qmgr 23662 root 20 0 5392 1732 1400 R 6.0 0.2 20:49.46 master
Network traffic was low and we had the normal throughput of emails.
Queue had only 73 emails in it when the problem happened (just like now, they are all deferred emails).
Doing "postfix stop" / "postfix start" solved the problem.
I reported the bug to Postfix Users mailing list and postfix's Author (Wietse Venema) found that it was a bug and posted a PATCH in the mailing list
Some snippets from the list:
-------
VICTOR DUCHOVNI:
Please wait for an updated patch, we believe we have identified the
cause and reproduced the symptoms (in that order). I have a candidate
patch, but I expect Wietse will send an updated more polished version
in the not too distant future.
The issue found applies only to "rate-limited" transports, if you are
not using such transports, you don't need the patch. The patch ensures
that work done at the completion of a delivery with a "normal" transport
is correctly split between "before suspend" and "after resume".
The original 2.5.x code is correct for "oqmgr", but not for "qmgr"
(aka "nqmgr"), which requires additional internal state adjustments
when destinations are blocked and unblocked.
-------
WIETSE VENEMA:
To apply this patch, cd into the Postfix-2.5.* top-level source
directory and execute:
$ patch < thismessage
We were able to reproduce the scheduler looping problem, and it
does not recur with the patched version.
Wietse
-------
I applied the patch and the problem didn't happen again, but I need that patch to be integrated into postfix's ubuntu deb packages so that I can still benefit of future security upgrades.
The patch was submitted at:
Date: Thu, 5 Mar 2009 17:41:51 -0500 (EST)
Thanks a lot.
I'm attaching the patch posted by Wietse Venema at the mailing list.
I don't know why, but I have 5 rejects applying the patch. The substituted code was OK, but some lines where 2 or 3 lines below the line-id's in the patch :-?
I corrected the .rej manually and the patch is working. Let me know if you need me to "extract" my "custom" patch as difference from my current sources and the standard postfix ubuntu sources.