Launchpad itself

no way to gracefully disconnect clients and shut down the bzr server

Bug #795025 reported by Tom Haddon on 2011-06-09

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Bazaar	Fix Released	High	John A Meinel	Bazaar 2.5b2
	Launchpad itself	Fix Released	Critical	Martin Pool

Bug Description

During the 11.06 rollout codehosting failed to stop in a reasonable time, and had to be manually killed.

[Abridged]

<mbarnett> crowberry doesn't want to let go
<wgrant> It never does.
<wgrant> kill -9
<wgrant> We've had to do that all year.
<lifeless> what on crowberry
<mbarnett> kk
<mbarnett> moving on now
<mthaddon> do we need to kill the branch-rewrite.py as well?
<lifeless> if it has DB access, hell yes
<mthaddon> ok, done
<lifeless> mbarnett: what on crowberry needed to be killed ?
<mwhudson> it does
<mthaddon> lifeless: codehosting itself
<mbarnett> codehost service itself
<lifeless> thanks
<wgrant> bzr-sftp, as always.
<lifeless> wgrant: is there a bug ?
<wgrant> Yes.
<mthaddon> lifeless: https://pastebin.canonical.com/48331/
<lifeless> we shouldn't need to shut codehosting down
<lifeless> it has no db access
<mwhudson> +1 lifeless
<elmo> host @lp_prod.lst all 91.189.90.11/32 ident map=crowberry
<lifeless> elmo: any chance that that is old ?
<mwhudson> elmo: sure, things on the box access the db -- but not bzr-sftp
<lifeless> elmo: or for the bzr-rewrite map ?
<elmo> lifeless: https://pastebin.canonical.com/48332/
<mthaddon> the fact remains we'd have to restart it with new code and if it's failing to stop that's a problem either way
<lifeless> mthaddon: its probably doing the graceful shutdown its designed to do
<mbarnett> pending warning removed.
<wgrant> lifeless: Not quite.
<lifeless> mthaddon: which with a HA config is desirable
<wgrant> lifeless: It hangs for even 10 minutes. There are always a few connections left. And twistd is OOPSing hundreds of times.
<lifeless> elmo: thanks, more fodder to get taken away from the DB
<lifeless> wgrant: ack

Tags:

Related branches

lp://staging/~jameinel/bzr/2.5-soft-hangup-795025

Merged into lp://staging/bzr at revision 6170

Jelmer Vernooij (community): Approve on 2011-09-26

Tom Haddon (mthaddon) on 2011-06-09

tags:

added: canonical-losa-lp

Francis J. Lacoste (flacoste) on 2011-06-09

Changed in launchpad:
importance:	Undecided → Critical
status:	New → Triaged

Revision history for this message

John A Meinel (jameinel) wrote on 2011-07-13:

I think there are two inter-related issues to tease apart.

1) We added soft-shutdown to twisted. Which was intended for things like HA. So that when you first issue shutdown, it stops accepting new connections, and waits for existing ones to close. Allowing us to transition to another codehosting service without killing everything immediately.

I think this is still roughly reasonable, but we probably should have a hard-stop, as it seems some people have keep-alive connections. (Probably something like master ssh connections, so they can avoid future ssh handshakes.)

2) Twisted has a failure mode during soft shutdown. Where sending SIGINT while it thinks it is stopping causes it to issue a traceback, and not actually do anything. (something about 'you can't stop a reactor that isn't running.')

I don't know what a reasonable time is for (1). It probably depends the most on what you need for deployments. Once you get multiple services, you could have a slow no-downtime deploy, and it seems feasible that you could wait a fairly long time (30min) as long as it would eventually progress and didn't need to be babysat by a LOSA. Barring that, probably something like 10min would be a feasible next step. (soft-shutdown process 1, no new connections, all new connections served by process 2, connections killed after 10 min, restart process 1, start shoft-shutdown on process 2, etc.)

Certainly the SSH connections are longer lived (and harder to hand-off to another process) than our corresponding HTTP web service requests. So I don't think the system can act exactly like the app-servers during no-downtime updates.

Revision history for this message

Thomas Herve (therve) wrote on 2011-07-14:

Just stumbling upon this bug, and I remembered about this crazy idea about connection migration that JP talked about a while ago: http://twistedmatrix.com/trac/browser/sandbox/exarkun/papers/2004/pycon/migration. It seems it would be a cool solution to your problem.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-07-26:

did we get any of the OOPSes to analyse?

Revision history for this message

Robert Collins (lifeless) wrote on 2011-07-26:

@therve given we restart to migrate versions, that concept doesn't seem applicable (or rather, not safe :P) here. But yes, its an interesting concept ;)

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-01:

William suspect it OOPSes on each received SIGTERM.

The reason its not shutting down is the soft shutdown logic combined with stale connections.
https://pastebin.canonical.com/50494/
https://pastebin.canonical.com/50495/

So one question is 'why are these connections staying open' - is it just the client? I would have expected TCP to detect the connection fail - we should check an invalid connection trying to do something would get a sensible response from the firewall [but with haproxy this may be better (or worse).]

If the connection is truely idle - no bzr verb in process, then perhaps bzr could disconnect for us after (say) 5 minutes with no activity?

I'm opening a task on bzr for that possibility.

John A Meinel (jameinel) on 2011-08-05

Changed in bzr:
importance:	Undecided → Medium
status:	New → Confirmed

Revision history for this message

Martin Pool (mbp) wrote on 2011-08-09:

OK, so to be clear, the bzr task here is that there should be a way (signal, file existence, etc) to ask bzr serve to close all client connections when they finish their current request.

An idle-socket timeout inside bzr serve would be nice, but isn't strictly necessary at this point because for Launchpad it can be done adequately well inside haproxy. (Though haproxy does need to be careful to choose a timeout long enough it will not disconnect a server that's just busy with io rather than idle.)

tags:

added: hpss launchpad ssh

Revision history for this message

John A Meinel (jameinel) wrote on 2011-08-09: Re: [Bug 795025] Re: Codehosting service didn't stop in a timely manner

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/9/2011 11:55 AM, Martin Pool wrote:
> OK, so to be clear, the bzr task here is that there should be a way
> (signal, file existence, etc) to ask bzr serve to close all client
> connections when they finish their current request.
>
> An idle-socket timeout inside bzr serve would be nice, but isn't
> strictly necessary at this point because for Launchpad it can be
> done adequately well inside haproxy. (Though haproxy does need to be
> careful to choose a timeout long enough it will not disconnect a
> server that's just busy with io rather than idle.)
>
> ** Tags added: hpss launchpad ssh
>

I believe there is also a bzr-the-client change to reconnect on SSH
disconnects. Which I don't think we do at this point. (I believe we do
for HTTP, but not for SSH.)

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk5BK88ACgkQJdeBCYSNAANF5wCeINvXBwxYzGthcLtjMRDR0NPT
AvUAn1OzqAu0Tc6IZATlTSTGDxySoOOm
=kLHx
-----END PGP SIGNATURE-----

Revision history for this message

Martin Pool (mbp) wrote on 2011-08-12: Re: [Bug 824797] [NEW] hpss hangs on idle connection

there are three related bugs:

bug 819604: reconnect when connection drops
bug 795025: allow admins to ask the server to close the connection
after it finishes the current request
bug 824797: drop connections that have been idle for a while

summary:

- Codehosting service didn't stop in a timely manner
+ no way to gracefully disconnect clients and shut down the server

Robert Collins (lifeless) on 2011-08-12

summary:

- no way to gracefully disconnect clients and shut down the server
+ no way to gracefully disconnect clients and shut down the bzr server

Martin Pool (mbp) on 2011-08-19

Changed in bzr:
importance:	Medium → High

John A Meinel (jameinel) on 2011-09-13

Changed in bzr:
assignee:	nobody → John A Meinel (jameinel)

John A Meinel (jameinel) on 2011-09-15

Changed in bzr:
status:	Confirmed → In Progress

Revision history for this message

John A Meinel (jameinel) wrote on 2011-09-15:

I've done bug #824797 (at least a rough cut), and it seems easy to add this on top of that. I'm thinking either SIGKILL or SIGHUP would be appropriate to send to the running process to say "finish your sentence and hangup". Is there some other signal that would be more appropriate?
Looking here: http://en.wikipedia.org/wiki/Signal_%28computing%29

SIGTERM, SIGHUP, SIGSTOP, SIGINT, SIGUSR1, SIGUSR2 all seem like they could be used

I'm guessing it doesn't really matter as long as it is well documented. I'll probably go with SIGTERM, since it is the "terminate nicely" command, though I've certainly seen SIGHUP used by daemons to mean "hangup" and restart/reload your data.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-09-15: Re: [Bug 795025] Re: no way to gracefully disconnect clients and shut down the bzr server

#10

I think SIGTERM should mean 'packup your toys asap', whereas this is
'packup at the first opportunity.'.

I *think* we use SIGUSR1 for LP to tell it to quiesce - and for the
codehosting ssh frontend too, so using the same signal for the same
intent would be nice.

Revision history for this message

John A Meinel (jameinel) wrote on 2011-09-22:

#11

I grepped the Launchpad source code and found:
./lib/canonical/launchpad/webapp/sigusr1.py

Which says:
To aid debugging, we install a handler for the SIGUSR1 signal. When
received, a summary of the last request, recent OOPS IDs and last
executed SQL statement is printed for each thread.

So it doesn't sound like "quiesce". There is also "RotatableFileLogObserver" where SIGUSR1 is used to re-open the log file (presumably after it has been rotated.)

there is also:
./lib/canonical/launchpad/webapp/sigusr2.py

Which says it rotates log files on SIGUSR2 (I'm not sure why the differences.)

the mailmain code talks a lot about SIGHUP. There is also:
./lib/canonical/launchpad/webapp/sighup.py
Which says:
def sighup_handler(signum, frame):
    """Switch the state of the HAProxy going_down flag."""
    haproxy.switch_going_down_flag()
    logging.getLogger('sighup').info(
        'Received SIGHUP, swiched going_down flag to %s' %
        haproxy.going_down_flag)

So it sounds like SIGHUP is the "quiesce" signal. So I'll re-use that. Looking at the haproxy code, it looks like that does, indeed, tell haproxy tell this webapp to start telling haproxy that it is no longer available.

Vincent Ladeuil (vila) on 2011-10-06

Changed in bzr:
milestone:	none → 2.5b2
status:	In Progress → Fix Released

Revision history for this message

Martin Pool (mbp) wrote on 2011-12-20:

#12

for the ssh server, we can take a different tack on this which is to tell the old process to close its listening socket and shut down when all the clients are gone. there is no reason why it needs to specifically try to shut down the current clients, and there is less chance of disruption to running clients if we don't. if there are obsolete daemon processes cluttering up the server, we can eventually just kill them off - clients need to cope with interrupted connections eventually anyhow.

Martin Pool (mbp) on 2011-12-20

Changed in launchpad:
assignee:	nobody → Martin Pool (mbp)

Revision history for this message

Martin Pool (mbp) wrote on 2011-12-21:

#13

After some discussion with robert (thanks), I don't think this requires any more lp code changes. It can possibly even be considered closed at the moment, though it might need some changes to the deployment configs.

An SSH server can be taken out of rotation using HAProxy; it will keep serving its current clients. It can be left running as long as you like; it doesn't need to block deployment of the new code.

We do need to eventually clean up the old tree, which can be done only when all the processes that are using it have finished. This is not very urgent but we don't want weeks-old code still running or using up disk space.

I suggest we give the old processes a few hours to complete, which should be enough for any process that ever is actually going to terminate. It will leave us with then just long-running pollers.

Then we can send a sighup to all the backend processes, which will cause them to exit when they complete the current request. We can wait a while for that to finish, perhaps ten minutes or more. Then we kill any processes that remain.

New bzr clients should reconnect; old ones may error out but we will have already restricted this to long lived pollers which perhaps need to expect to be treated a bit roughly.

This should be either not too laborious to do manually, or fairly easy to script.

Revision history for this message

Martin Pool (mbp) wrote on 2011-12-21:

#14

Looking into this a bit more with hloeung:

We are already using start-stop-service to gracefully stop the server. We can change the option to --retry TERM/7200/KILL/5 that can send a SIGKILL to the front end process if it does not exit after a time. So, we can ask it nicely to exit, at which point it will stop serving new connections, and then a couple of hours later if it still exists we can kill it, at which point all the existing connections will definitely drop. We tested this, and the back end processes will indeed stop.

So:

* clients that finish up their business in less than a couple of hours will see no interruption
* clients that run longer than that will see the server abruptly disconnect
* new bzr clients will reconnect and retry
* old clients will probably abort, but if they are holding connections open indefinitely, they will probably see some network drops anyhow
* once we are confident that the reconnect code is safe, we can backport it to previous bzr releases

Haw says that killing the server after a time will fix the LOSA-affecting issue, ie that rollouts take a lot of manual intervention, so for now I'm going to close the lp side of it.

Changed in launchpad:
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.