inconsistent juju status from cli vs api

Bug #1467690 reported by JuanJo Ciarlante
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Ian Booth
1.24
Fix Released
Critical
Ian Booth

Bug Description

juju version: 1.24.0-trusty-amd64 (agent-version: 1.24.0.1)

We have an environment showing a clean 'juju status': no
'error' string found at its output, excerpt for the failing
'landscape-client/1' subord unit:
http://paste.ubuntu.com/11759456/ ,
but using a python script fetch status via API [0]
shows the unit erroring with:
  agent-state-info: 'hook failed: "leader-elected"':
http://paste.ubuntu.com/11759469/

[0] http://paste.ubuntu.com/11759446/

Revision history for this message
JuanJo Ciarlante (jjo) wrote :

BTW we found this issue from juju-deployer failing as per above,
note that trying to resolve it fails:
$ juju resolved -r landscape-client/1
ERROR unit "landscape-client/1" is not in an error state

i.e. this issue leaves the environment inoperable by juju-deployer.

Ian Booth (wallyworld)
Changed in juju-core:
importance: Undecided → High
status: New → Triaged
milestone: none → 1.25.0
Revision history for this message
Ian Booth (wallyworld) wrote :

I have a theory as to what's happening. The CLI status is reporting the correct status, but the status via the API is wrong. The API reported status uses an all watcher backing model. That model appears to be incorrectly updated in response to some status changes, and thus reports stale data to callers like deployer. Restarting the start server(s) seems to have got around the issue, lending credence to this theory.

The issue seemed to happen when a leader election hook failed and then ran again and comes good second time. No user did a resolve --retry to reset things.

The code below is called when a status value changes. If the change is for a unit (#charm) or is for an error, the workload status is put into error. Once the error goes away status for agent is updated back to "idle" but the first if{} block is not run ever and so the unit workload status remains in error state. I think we just need some logic to say if workload state is error and new incoming agent state is not error, reset the workload status to what's currently in state. We may need to record the previous non error workload status on the backing doc to make this work.

func (s *backingStatus) updatedUnitStatus(st *State, store *multiwatcherStore, id string, newInfo *multiwatcher.UnitInfo) error {
 // Unit or workload status - display the agent status or any error.
 if strings.HasSuffix(id, "#charm") || s.Status == StatusError {
  newInfo.WorkloadStatus.Current = multiwatcher.Status(s.Status)
  newInfo.WorkloadStatus.Message = s.StatusInfo
  newInfo.WorkloadStatus.Data = s.StatusData
  newInfo.WorkloadStatus.Since = s.Updated
 } else {
  newInfo.AgentStatus.Current = multiwatcher.Status(s.Status)
  newInfo.AgentStatus.Message = s.StatusInfo
  newInfo.AgentStatus.Data = s.StatusData
  newInfo.AgentStatus.Since = s.Updated
 }

Revision history for this message
Brad Marshall (brad-marshall) wrote :

As per request from Ian, attaching a mongo dump of the statuses collection from the state server.

Revision history for this message
Brad Marshall (brad-marshall) wrote :

And now here's the statuseshistory collection.

Ian Booth (wallyworld)
Changed in juju-core:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.