During the 11.06 rollout codehosting failed to stop in a reasonable time, and had to be manually killed.
[Abridged]
<mbarnett> crowberry doesn't want to let go
<wgrant> It never does.
<wgrant> kill -9
<wgrant> We've had to do that all year.
<lifeless> what on crowberry
<mbarnett> kk
<mbarnett> moving on now
<mthaddon> do we need to kill the branch-rewrite.py as well?
<lifeless> if it has DB access, hell yes
<mthaddon> ok, done
<lifeless> mbarnett: what on crowberry needed to be killed ?
<mwhudson> it does
<mthaddon> lifeless: codehosting itself
<mbarnett> codehost service itself
<lifeless> thanks
<wgrant> bzr-sftp, as always.
<lifeless> wgrant: is there a bug ?
<wgrant> Yes.
<mthaddon> lifeless: https://pastebin.canonical.com/48331/
<lifeless> we shouldn't need to shut codehosting down
<lifeless> it has no db access
<mwhudson> +1 lifeless
<elmo> host @lp_prod.lst all 91.189.90.11/32 ident map=crowberry
<lifeless> elmo: any chance that that is old ?
<mwhudson> elmo: sure, things on the box access the db -- but not bzr-sftp
<lifeless> elmo: or for the bzr-rewrite map ?
<elmo> lifeless: https://pastebin.canonical.com/48332/
<mthaddon> the fact remains we'd have to restart it with new code and if it's failing to stop that's a problem either way
<lifeless> mthaddon: its probably doing the graceful shutdown its designed to do
<mbarnett> pending warning removed.
<wgrant> lifeless: Not quite.
<lifeless> mthaddon: which with a HA config is desirable
<wgrant> lifeless: It hangs for even 10 minutes. There are always a few connections left. And twistd is OOPSing hundreds of times.
<lifeless> elmo: thanks, more fodder to get taken away from the DB
<lifeless> wgrant: ack
I think there are two inter-related issues to tease apart.
1) We added soft-shutdown to twisted. Which was intended for things like HA. So that when you first issue shutdown, it stops accepting new connections, and waits for existing ones to close. Allowing us to transition to another codehosting service without killing everything immediately.
I think this is still roughly reasonable, but we probably should have a hard-stop, as it seems some people have keep-alive connections. (Probably something like master ssh connections, so they can avoid future ssh handshakes.)
2) Twisted has a failure mode during soft shutdown. Where sending SIGINT while it thinks it is stopping causes it to issue a traceback, and not actually do anything. (something about 'you can't stop a reactor that isn't running.')
I don't know what a reasonable time is for (1). It probably depends the most on what you need for deployments. Once you get multiple services, you could have a slow no-downtime deploy, and it seems feasible that you could wait a fairly long time (30min) as long as it would eventually progress and didn't need to be babysat by a LOSA. Barring that, probably something like 10min would be a feasible next step. (soft-shutdown process 1, no new connections, all new connections served by process 2, connections killed after 10 min, restart process 1, start shoft-shutdown on process 2, etc.)
Certainly the SSH connections are longer lived (and harder to hand-off to another process) than our corresponding HTTP web service requests. So I don't think the system can act exactly like the app-servers during no-downtime updates.