From looking at the source code and the logs, I believe this is what happens:
- snapd snap is updated
- snapd requests a restart
- we reach the code in daemon.Stop() where we attempt a graceful shutdown of http connections
- the shutdown code reaches a default timeout (25s, matches with the log), while there's still active connections
- the listen socket is closed as the first step, so no new connections are possible
- the error of the tomb is set and propagated up the call stack
- snapd exits with an error triggering a failover handling which reverts the new revision
I suspect there may be a slow client hanging on the snapd API socket that keeps the connection alive. We should consider swallowing up the error like we do for system restarts, or using a forceful http.Server.Close() when the timeout is hit.
From looking at the source code and the logs, I believe this is what happens:
- snapd snap is updated
- snapd requests a restart
- we reach the code in daemon.Stop() where we attempt a graceful shutdown of http connections
- the shutdown code reaches a default timeout (25s, matches with the log), while there's still active connections
- the listen socket is closed as the first step, so no new connections are possible
- the error of the tomb is set and propagated up the call stack
- snapd exits with an error triggering a failover handling which reverts the new revision
I suspect there may be a slow client hanging on the snapd API socket that keeps the connection alive. We should consider swallowing up the error like we do for system restarts, or using a forceful http.Server.Close() when the timeout is hit.