I downloaded the massive pile of log files and started pawing through them. It seems we have multiple issues :-(
1) the machine agents look and see that there are no upgrade steps to execute, so there should be no restricted api wrapper, which means you shouldn't see "machine upgrading" api responses
2) The API Server worker appears to fail to stop under load. We are still tracking down the reason for this, but we think that if the mongo db session pinger gets stuck, it will prevent this, and there are indications in the log that something like this is happening.
This will the be cause of the main HA nodes machine agents not stopping in response to the upgrade request. When they are manually restarted, they appear to notice that they need to restart for the upgrade before they get "too busy".
3) Some uniter workers did not stop in response to the kill request. Still looking into this one. This will be the reason that some unit agents showed the old agent version
I haven't yet dug into why so many of the units were left in an error state.
Oh dear, where to start.
I downloaded the massive pile of log files and started pawing through them. It seems we have multiple issues :-(
1) the machine agents look and see that there are no upgrade steps to execute, so there should be no restricted api wrapper, which means you shouldn't see "machine upgrading" api responses
2) The API Server worker appears to fail to stop under load. We are still tracking down the reason for this, but we think that if the mongo db session pinger gets stuck, it will prevent this, and there are indications in the log that something like this is happening.
This will the be cause of the main HA nodes machine agents not stopping in response to the upgrade request. When they are manually restarted, they appear to notice that they need to restart for the upgrade before they get "too busy".
3) Some uniter workers did not stop in response to the kill request. Still looking into this one. This will be the reason that some unit agents showed the old agent version
I haven't yet dug into why so many of the units were left in an error state.