I have a theory as to what's happening. The CLI status is reporting the correct status, but the status via the API is wrong. The API reported status uses an all watcher backing model. That model appears to be incorrectly updated in response to some status changes, and thus reports stale data to callers like deployer. Restarting the start server(s) seems to have got around the issue, lending credence to this theory.
The issue seemed to happen when a leader election hook failed and then ran again and comes good second time. No user did a resolve --retry to reset things.
The code below is called when a status value changes. If the change is for a unit (#charm) or is for an error, the workload status is put into error. Once the error goes away status for agent is updated back to "idle" but the first if{} block is not run ever and so the unit workload status remains in error state. I think we just need some logic to say if workload state is error and new incoming agent state is not error, reset the workload status to what's currently in state. We may need to record the previous non error workload status on the backing doc to make this work.
func (s *backingStatus) updatedUnitStatus(st *State, store *multiwatcherStore, id string, newInfo *multiwatcher.UnitInfo) error {
// Unit or workload status - display the agent status or any error.
if strings.HasSuffix(id, "#charm") || s.Status == StatusError {
newInfo.WorkloadStatus.Current = multiwatcher.Status(s.Status)
newInfo.WorkloadStatus.Message = s.StatusInfo
newInfo.WorkloadStatus.Data = s.StatusData
newInfo.WorkloadStatus.Since = s.Updated
} else {
newInfo.AgentStatus.Current = multiwatcher.Status(s.Status)
newInfo.AgentStatus.Message = s.StatusInfo
newInfo.AgentStatus.Data = s.StatusData
newInfo.AgentStatus.Since = s.Updated
}
I have a theory as to what's happening. The CLI status is reporting the correct status, but the status via the API is wrong. The API reported status uses an all watcher backing model. That model appears to be incorrectly updated in response to some status changes, and thus reports stale data to callers like deployer. Restarting the start server(s) seems to have got around the issue, lending credence to this theory.
The issue seemed to happen when a leader election hook failed and then ran again and comes good second time. No user did a resolve --retry to reset things.
The code below is called when a status value changes. If the change is for a unit (#charm) or is for an error, the workload status is put into error. Once the error goes away status for agent is updated back to "idle" but the first if{} block is not run ever and so the unit workload status remains in error state. I think we just need some logic to say if workload state is error and new incoming agent state is not error, reset the workload status to what's currently in state. We may need to record the previous non error workload status on the backing doc to make this work.
func (s *backingStatus) updatedUnitStat us(st *State, store *multiwatcherStore, id string, newInfo *multiwatcher. UnitInfo) error { HasSuffix( id, "#charm") || s.Status == StatusError { WorkloadStatus. Current = multiwatcher. Status( s.Status) WorkloadStatus. Message = s.StatusInfo WorkloadStatus. Data = s.StatusData WorkloadStatus. Since = s.Updated AgentStatus. Current = multiwatcher. Status( s.Status) AgentStatus. Message = s.StatusInfo AgentStatus. Data = s.StatusData AgentStatus. Since = s.Updated
// Unit or workload status - display the agent status or any error.
if strings.
newInfo.
newInfo.
newInfo.
newInfo.
} else {
newInfo.
newInfo.
newInfo.
newInfo.
}