agent lost, even though agents are still responding
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
Critical
|
John A Meinel | ||
2.2 |
Fix Released
|
Critical
|
John A Meinel |
Bug Description
With 2.2.1 we are seeing agents show up as missing (lost), even though they seem to still be responding to events (update-status still fires, changing configuration still seems to apply that configuration to the running applications.)
It is possible that some of our recent changes around how we aggregate and record Pings have caused liveliness to get lost.
Debugging the presence tables (with the attached python script) has shown that we see a bunch of agents without pings in the recent ping slots (as many as 7 slots without any ticks, would be 3.5min of not having presence information for those agents).
We are not seeing Ping entries that correspond to beings that are no longer in the database, so it is unlikely that pruning is removing active beings. (2.2.2 also has a patch already about this, so that instead of only preserving the beings for the most-recent-2-slots we preserve beings that are active in any slot that is still in the database.)
It seems to be necessary to have HA controllers in order to trigger this behavior.
At this point it is unclear whether the agents have stopped sending Ping() messages, or whether we are just failing to record them.
It is possible we have a bug where a PingBatcher will get wedged trying to write to the database, and then still be accepting Pings and not getting restarted, but just never writes any more presence information. (And thus we lose the presence for all agents that are connected to that specific batcher.)
We haven't yet reproduced this in isolation.
Changed in juju: | |
status: | Fix Committed → Fix Released |
So I managed to trigger a failure that we don't recover from. I don't know if this is the problem that is happening in the wild, but it is *a* problem, which we should fix.
If you inject an invalid entry in the database to force the PingBatcher to die: pings.insert( {"_id": "8e869b13- 85a7-4d86- 8a3e-4d332b4306 e8:1499793870" , "slot" : NumberLong( 1499793870) , "alive" : { "1" : "a"}})
db.presence.
(note the slot was picked as 60 greater than the biggest slot already)
Then PingBatcher will fail with whatever error: 85a7-4d86- 8a3e-4d332b4306 e8:1499793870" } has the field '1' of non-numeric type String
machine-0: 21:24:32 ERROR juju.worker exited "pingbatcher": Cannot apply $inc to a value of non-numeric type. {_id: "8e869b13-
At that point, we actually *restart* PingBatcher, but all of the existing Pinger objects continue to use the now-dead PingBatcher, so they all actually end up blocked/timing out.
Now, this is overly forceful, as it will cause all PingBatchers on all controllers to die. But imagine that 1 PingBatcher was dying on 1 controller. Then it would have a similar symptom.
The issue is that we construct Pingers passing in a PingBatcher to use, but we don't have them use an updated PingBatcher if there is any reason that we need to restart the PingBatcher.
Instead, we need to give them a function that will return whatever the currently live PingBatcher we have.
(I'm also not sure what happens if a Pinger actually dies due to an error, as near as I can tell we don't ever restart them, either.)