GNU Mailman

Race in LockFile.py resulting in extraneous log entries.

Bug #558153 reported by grnbrg on 2004-10-12

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	GNU Mailman	New	Low	Unassigned

Bug Description

There is a race condition (that does not affect correct
operation of Mailman) in .../Mailman/LockFile.py. This
race results in entries to the "locks" log file
(usually in pairs) that are unnecessary and confusing.

If there is a process holding a lock for a file, and a
process waiting for that lock to be freed, the
following sequence can occur:

1) The waiting process executes os.link(). Since the
running process still has the lock, this fails with EEXIST.
2) The running process releases the lock, and removes
the lock file.
3) The waiting process proceeds with it's checks, and
checks self.__linkcount() which returns -1 due to an
ENOENT error. This throws an error into the "locks"
log file.
4) The waiting process continues processing, checks
it's timeout, and then checks to see if the lock
lifetime has expired. The lifetime check is based on a
call to self.__releasetime(), which returns -1, due to
an ENOENT error. This results in the lockfile being
declared stale, and self.__break() is called. This
then throws a second error into the "locks" log file.

The attached patch checks for these ENOENT errors,
immediately restarting the loop when they are detected.
It does not affect operation is the lock file exists
and is held by another process.

Brian Greenberg.

Revision history for this message

grnbrg (grnbrg) wrote on 2004-10-12:

RE: Race in LockFile.py resulting in extraneous log entries. Edit (3.0 KiB, )

The file LockFile.py.patch was added: Patch to LockFile.py to fix race.

Revision history for this message

shigeno (shigeno) wrote on 2004-10-15:

Logged In: YES
user_id=1138453

How about changing as follows about self.__releasetime()?

@@ -448,7 +448,7 @@
             return os.stat(self.__lockfile)[ST_MTIME]
         except OSError, e:
             if e.errno <> errno.ENOENT: raise
- return -1
+ return time.time()

def __linkcount(self):
try:

Revision history for this message

grnbrg (grnbrg) wrote on 2004-10-15:

Logged In: YES
user_id=902583

This would fix this bug, as long as CLOCK_SLOP is large enough.

But I don't think it's the Right Thing, as you are returning
a result that says that the lockfile has a lifetime, when it
in fact simply no longer exists. For a function whose
purpose is to report the mtime of the master lockfile, -1 is
the correct response if the file doesn't exist. The code
that calls that function should check for and interpret that
condition.

Revision history for this message

tkikuchi (tkikuchi-users) wrote on 2004-10-22:

Logged In: YES
user_id=67709

Will you be more specific how this race occurs? My
installation on FreeBSD doesn't looks like show this symptom
in locks log file.
I tested with two terminals invoking python -i bin/withlist
testlist and try to lock with m.Lock() in both the terminal.
One of term is forced to wait until the I type m.Unlock() in
the other term. I then looked into logs dir but nothing avail.

Revision history for this message

bwarsaw (bwarsaw) wrote on 2004-10-22:

Logged In: YES
user_id=12800

I haven't had time to look at this in detail, but I'll just
include a general warning that LockFile.py is /very/
fragile. It's also critical to the proper operation of
Mailman. So any changes, even seemingly innocent onces,
must be very thoroughly tested. My inclination would be to
reject any change that wasn't absolutely necessary, but I
won't veto this change if it is well-tested and fixes a
verified bug (tho' it sounds like tkikuchi can't reproduce it).

Revision history for this message

grnbrg (grnbrg) wrote on 2004-10-22:

Logged In: YES
user_id=902583

I'm running on a Solaris box, which may or may not have any
impact on the issue.

I'm not sure how to force reproduction -- races are hard to
intentially duplicate. I first noticed the problem under a
heavy load due to an admin "oops" -- 30,000+ news articles
being gatewayed into a test list. :) Once the oops was
fixed, the error logs continued to occasionally appear, and
rather than assume that everything was ok, I tracked down
the source of the problem. I did this by desk-checking the
source, and identified the race by a bit of guesswork. Once
these changes were made, the problem stopped.

The race occurs as follows:

1) A qrunner runs LockFile.lock() to request a particular lock.
2) At LockFile.py:256, this qrunner attempts to link it's
temp lock file to the global lockfile. This link fails with
a EEXIST error, because another process currently holds the
lock. The assumption is made at this point that the global
lockfile exists. This (false) assumption continues through
to the end of the loop at LockFile.py:312. In fact, the
global lockfile may cease to exist (because the qrunner that
has it calls unlock()) at any time.
3) The process holding the lock calls unlock().
4) The waiting process (which just got an EEXIST) goes
through it's various sanity checks, which include checking
the linkcount and mtime of the global lockfile, which now
does not exist. Since it is assumed to exist, the -1
results returned make no sense, and are unconditionally logged.
5) At LockFile.py:303, the (incorrectly assumed) existing
lock is broken, and control goes back to the top of the loop
(after a quick __sleep())

As far as impact, the patch should have minimal side
effects. I pulled the function calls out of the if
statements, so there is no change in the number of calls.
Further, all the new code does is to check for the "-1"
return that indicates that the global lockfile no longer
exists, and immediately restarts the loop if that is the
case. It might not be a bad idea to add a "self.__sleep()"
before the continue, but I didn't think it was needed.

Hope that adds enough detail to trace what I think is
happening, and verify that it won't break anything.

FWIW, I have not had any problems, nor any errors logged
since applying this locally.

Logged In: YES 
user_id=902583

I'm running on a Solaris box, which may or may not have any
impact on the issue.

I'm not sure how to force reproduction -- races are hard to
intentially duplicate.  I first noticed the problem under a
heavy load due to an admin "oops" -- 30,000+ news articles
being gatewayed into a test list.  :)  Once the oops was
fixed, the error logs continued to occasionally appear, and
rather than assume that everything was ok, I tracked down
the source of the problem.  I did this by desk-checking the
source, and identified the race by a bit of guesswork.  Once
these changes were made, the problem stopped.

The race occurs as follows:

1) A qrunner runs LockFile.lock() to request a particular lock.
2) At LockFile.py:256, this qrunner attempts to link it's
temp lock file to the global lockfile.  This link fails with
a EEXIST error, because another process currently holds the
lock.  The assumption is made at this point that the global
lockfile exists.  This (false) assumption continues through
to the end of the loop at LockFile.py:312.  In fact, the
global lockfile may cease to exist (because the qrunner that
has it calls unlock()) at any time.
3) The process holding the lock calls unlock().
4) The waiting process (which just got an EEXIST) goes
through it's various sanity checks, which include checking
the linkcount and mtime of the global lockfile, which now
does not exist.  Since it is assumed to exist, the -1
results returned make no sense, and are unconditionally logged.
5) At LockFile.py:303, the (incorrectly assumed) existing
lock is broken, and control goes back to the top of the loop
(after a quick __sleep())

As far as impact, the patch should have minimal side
effects.  I pulled the function calls out of the if
statements, so there is no change in the number of calls. 
Further, all the new code does is to check for the "-1"
return that indicates that the global lockfile no longer
exists, and immediately restarts the loop if that is the
case.  It might not be a bad idea to add a "self.__sleep()"
before the continue, but I didn't think it was needed.

Hope that adds enough detail to trace what I think is
happening, and verify that it won't break anything.

FWIW, I have not had any problems, nor any errors logged
since applying this locally.

Revision history for this message

ppsys (ppsys-users) wrote on 2004-10-24:

Logged In: YES
user_id=75166

The comments in the LockFile.py header code make the point that the
entire mechanism is to deal with NFS robustness/compatibilty issues. The
extranenous logging that concerns grnbrg is appearing on Solaris
(version and build unspecified) and attempts to reproduce it, that have
thus far failed, appear to be on FreeBSD. I suspect that the majority of
MM installs are on various versions and builds of Linux. These OSen will
all have substantially different NFS implementations and major
differences in their kernel that probably affect NFS implementation and
operation. Because of these differences I have already encountered
sufficient problems with interworking of Linux NFS clients with Solaris
NFS servers to convince me that changing LockFile.py without sufficient
testing in order to clear an essentially cosmetic problem on one host OS
is a an unecessary risk to stability across the board. Just testing on one
host is not good enough and reading code even less so; it is the
dynamics of operation in live situations that are the issue. A test
programme needs to at least nod in the direction of heterogenous NFS
client/host configurations of various types being in used for validation
and verification of any changes.

Revision history for this message

tkikuchi (tkikuchi-users) wrote on 2004-10-25:

Logged In: YES
user_id=67709

OK. I found the log output in my solaris installation. But,
I can't reproduce the said log-pair using bin/withlist. It
looks like occurrence of this log output is unpredictable
and need more checking before applying the patch.

I found
[unexpected linkcount: -1/lifetime has expired, breaking]
pair in Solaris
[lifetime has expired, breaking] only in BSD/OS
no locks log in FreeBSD

So, the behaviour is highly OS dependent I guess.

Revision history for this message

grnbrg (grnbrg) wrote on 2004-10-26:

Logged In: YES
user_id=902583

I'm running Solaris 8, with the locks and qfiles on local
disk. mm 2.1.5 and python 2.3.3. There are a little over
1000 lists, with traffic of 300-500 posts per day, on average.

Given that it's cosmetic, there is no rush.

I'm open to testing suggestions...

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

RE: Race in LockFile.py resulting in extraneous log entries. Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.