I think there are two inter-related issues to tease apart.
1) We added soft-shutdown to twisted. Which was intended for things like HA. So that when you first issue shutdown, it stops accepting new connections, and waits for existing ones to close. Allowing us to transition to another codehosting service without killing everything immediately.
I think this is still roughly reasonable, but we probably should have a hard-stop, as it seems some people have keep-alive connections. (Probably something like master ssh connections, so they can avoid future ssh handshakes.)
2) Twisted has a failure mode during soft shutdown. Where sending SIGINT while it thinks it is stopping causes it to issue a traceback, and not actually do anything. (something about 'you can't stop a reactor that isn't running.')
I don't know what a reasonable time is for (1). It probably depends the most on what you need for deployments. Once you get multiple services, you could have a slow no-downtime deploy, and it seems feasible that you could wait a fairly long time (30min) as long as it would eventually progress and didn't need to be babysat by a LOSA. Barring that, probably something like 10min would be a feasible next step. (soft-shutdown process 1, no new connections, all new connections served by process 2, connections killed after 10 min, restart process 1, start shoft-shutdown on process 2, etc.)
Certainly the SSH connections are longer lived (and harder to hand-off to another process) than our corresponding HTTP web service requests. So I don't think the system can act exactly like the app-servers during no-downtime updates.
I think there are two inter-related issues to tease apart.
1) We added soft-shutdown to twisted. Which was intended for things like HA. So that when you first issue shutdown, it stops accepting new connections, and waits for existing ones to close. Allowing us to transition to another codehosting service without killing everything immediately.
I think this is still roughly reasonable, but we probably should have a hard-stop, as it seems some people have keep-alive connections. (Probably something like master ssh connections, so they can avoid future ssh handshakes.)
2) Twisted has a failure mode during soft shutdown. Where sending SIGINT while it thinks it is stopping causes it to issue a traceback, and not actually do anything. (something about 'you can't stop a reactor that isn't running.')
I don't know what a reasonable time is for (1). It probably depends the most on what you need for deployments. Once you get multiple services, you could have a slow no-downtime deploy, and it seems feasible that you could wait a fairly long time (30min) as long as it would eventually progress and didn't need to be babysat by a LOSA. Barring that, probably something like 10min would be a feasible next step. (soft-shutdown process 1, no new connections, all new connections served by process 2, connections killed after 10 min, restart process 1, start shoft-shutdown on process 2, etc.)
Certainly the SSH connections are longer lived (and harder to hand-off to another process) than our corresponding HTTP web service requests. So I don't think the system can act exactly like the app-servers during no-downtime updates.