I spent some time on this today, and I think that upstart's handling of respawn + post-start is probably making jobs harder than they need to be.
In the mysql case, we need to exit the mysqladmin ping loop if the process goes into 'respawn', so that the process can in fact be respawned... otherwise it waits. This also causes the usual respawn limit to not be hit because we're respawning 1 time every 30 seconds. In fact, just the lag of forking and running mysqladmin ping is enough to prevent us from restarting 10 times in 5 seconds, which is the default limit.
So IMO, upstart should have a way to respawn independent of post-start. But that is a much bigger change and needs more thought. For jobs with a post-start, they need to make sure the post-start is reactive to the respawn status by exitting when it is detected, and considering this in any respawn limits.
So, I'll be uploading a fix to precise's mysql-5.5 package which will fail with 2 respawns in 5 seconds, and exits the mysqladmin loop if status is respawn. This causes the job to report a failure to start if things are truly broken, which is what we want. It will also cause respawn to give up faster if mysqld exits, but I think that is fine given that it is a large database daemon and probably isn't going to respond well to the rapid respawning.
I spent some time on this today, and I think that upstart's handling of respawn + post-start is probably making jobs harder than they need to be.
In the mysql case, we need to exit the mysqladmin ping loop if the process goes into 'respawn', so that the process can in fact be respawned... otherwise it waits. This also causes the usual respawn limit to not be hit because we're respawning 1 time every 30 seconds. In fact, just the lag of forking and running mysqladmin ping is enough to prevent us from restarting 10 times in 5 seconds, which is the default limit.
So IMO, upstart should have a way to respawn independent of post-start. But that is a much bigger change and needs more thought. For jobs with a post-start, they need to make sure the post-start is reactive to the respawn status by exitting when it is detected, and considering this in any respawn limits.
So, I'll be uploading a fix to precise's mysql-5.5 package which will fail with 2 respawns in 5 seconds, and exits the mysqladmin loop if status is respawn. This causes the job to report a failure to start if things are truly broken, which is what we want. It will also cause respawn to give up faster if mysqld exits, but I think that is fine given that it is a large database daemon and probably isn't going to respond well to the rapid respawning.