There seems to be a bug in Twisted 10.2. If something bad happens, the system can get into the state where it isn't stopped, but both SIGTERM and SIGINT refuse to shut down because they "can't stop a reactor that isn't running".
I wasn't very worried about it, but it just happened in production, so I'm escalating the severity. The traceback looks something like this:
2011-02-11 11:21:17+0000 [-] Received SIGTERM, shutting down.
2011-02-11 11:21:17+0000 [-] Unhandled Error
Traceback (most recent call last):
File "/srv/bazaar.launchpad.net/production/launchpad-rev-12351/eggs/Twisted-10.2.0_4395fix_1-py2.6-linux-x86_64.egg/twisted/application/app.py", line 390, in startReactor
self.config, oldstdout, oldstderr, self.profiler, reactor)
File "/srv/bazaar.launchpad.net/production/launchpad-rev-12351/eggs/Twisted-10.2.0_4395fix_1-py2.6-linux-x86_64.egg/twisted/application/app.py", line 311, in runReactorWithLogging
reactor.run()
File "/srv/bazaar.launchpad.net/production/launchpad-rev-12351/eggs/Twisted-10.2.0_4395fix_1-py2.6-linux-x86_64.egg/twisted/internet/base.py", line 1158, in run
self.mainLoop()
File "/srv/bazaar.launchpad.net/production/launchpad-rev-12351/eggs/Twisted-10.2.0_4395fix_1-py2.6-linux-x86_64.egg/twisted/internet/base.py", line 1167, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "/srv/bazaar.launchpad.net/production/launchpad-rev-12351/eggs/Twisted-10.2.0_4395fix_1-py2.6-linux-x86_64.egg/twisted/internet/base.py", line 762, in runUntilCurrent
f(*a, **kw)
File "/srv/bazaar.launchpad.net/production/launchpad-rev-12351/eggs/Twisted-10.2.0_4395fix_1-py2.6-linux-x86_64.egg/twisted/internet/base.py", line 570, in stop
"Can't stop reactor that isn't running.")
twisted.internet.error.ReactorNotRunning: Can't stop reactor that isn't running.
This is causing lots of oops reports such as OOPS-1868SMPSSH1000
<jml> just found out about this issue affecting us: https:/ /bugs.launchpad .net/launchpad/ +bug/717205
<exarkun> Unless there's more details, I don't think that's new in 10.2
You could always have a before shutdown trigger that returns a Deferred that doesn't fire as soon as you'd like
<exarkun> And re-sending a shutdown signal while waiting for that would always do something wacky
<exarkun> But! Certainly it would be nice to do something better.
In my experience with this error, that's exactly the cause.
So the obvious question is: what would “something better” be? Some possibilities: signal- received- during- blocked- shutdown warns, forces shutdown to proceed, or whatever, so that service authors can choose which behaviour they want.
* a subsequent SIGTERM/SIGINT forces shutdown to continue without waiting for the unfired Deferred(s).
* a subsequent SIGTERM/SIGINT just logs a simple warning “SIGFOO received but shutdown already in progress”
* a subsequent SIGTERM/SIGINT logs a warning and some details about what it is waiting on (ideally identifying the trigger(s) involved and even e.g. the outstanding connections or whatever is involved)
* provide an API on the reactor that controls whether shutdown-
I'm sure there are others. I'm not sure which is best for Launchpad's use case(s), or best in general.
A workaround for Launchpad's cases may be to explicitly override SIGTERM/INT from the before-shutdown trigger we register, but it seems fairly clear to me that Twisted should offer better facilities here. The relevant upstream bug appears to be <http:// twistedmatrix. com/trac/ ticket/ 4406>.