juju restore-backup does not complete properly

Bug #1569467 reported by Curtis Hovey
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Ian Booth

Bug Description

As seen in this false success
    http://reports.vapour.ws/releases/3881/job/functional-backup-restore/attempt/4000
reports a success because the the controller was restored, but the errors during kill-controller reveal the hosted model was lost. So while the controller was restored, Users cannot manage the hosted models that they ran restore to get access to.

Changed in juju-core:
assignee: nobody → Cheryl Jennings (cherylj)
tags: added: blocker
Changed in juju-core:
importance: High → Critical
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Another CI run was done with --verbose which gives more output: http://juju-ci.vapour.ws:8080/view/Juju%20Revisions/job/functional-backup-restore/4002/

When the controller is bootstrapped, the default model is given the name "functional-backup-restore". However, post-restore, we see this as the models in the controller:

2016-04-12 20:58:40 DEBUG ('juju', '--show-log', 'list-models', '-c', 'functional-backup-restore', '--format', 'yaml')
2016-04-12 20:58:40 DEBUG models:
- name: admin
  model-uuid: b64b83e7-3ac2-4398-857f-ecd64bb37c71
  owner: admin@local
  last-connection: 2 seconds ago
- name: default
  model-uuid: ab08a8c9-f6fe-4ee4-8fd9-b919dba63b8b
  owner: admin@local
  last-connection: never connected
current-model: admin

I am able to backup / restore with the hosted model just fine in joyent (AWS has issues - mongo keeps getting SIGKILL, maybe my instance size is too small?)

I'm wondering if something like that is happening here where the restore doesn't actually complete.

Curtis - can we get a run that keeps the environment up on failure?

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Some other things to note:

- This is not a new failure. It just wasn't being identified previously. The last run of master which did NOT have this failure was: http://reports.vapour.ws/releases/3833

- The first run on master to have this failure was: http://reports.vapour.ws/releases/3845

- MADE-model workers was merged into master during that time, and I see that this failure occurred on that branch before it was seen in master.
    - First MADE-model-workers run with this failure: http://reports.vapour.ws/releases/3826
    - Last MADE-model-workers run without this failure: http://reports.vapour.ws/releases/3795

- William made a change to restore around this time via bug #1559712 There's no PR in the bug, and nothing in the git log that references that bug. But, maybe it's a starting point.

Revision history for this message
Ian Booth (wallyworld) wrote :

I can reproduce this on LXD. I have observed that:
- the bson dump files from the backup does contain the hosted model document
- the restore -b bootstraps a new controller and hence creates an empty hosted model called "default"
- only the admin model is restored

So any notion that the restore worked with a hosted model was sort of a red herring. If the original hosted model was called "default" then list-models in a restored system would have shown a model called "default" but it would have been empty.

Why is only the admin model restored, not sure yet. It seems that mongorestore is called and loads the entire dump so am not sure what's happening.

Revision history for this message
Ian Booth (wallyworld) wrote :

My next thought was that restore is not even working at all. So we are just ending up with the default bootstrapped admin and default hosted model. juju status confirms this. So restore is initiated:

2016-04-13 08:43:24 DEBUG juju.api.backups restore.go:61 Will attempt to call 'PrepareRestore'
2016-04-13 08:43:24 INFO juju.juju api.go:213 connecting to API addresses: [10.0.4.35:17070]

But fails. Interesting log entries:

machine-0: 2016-04-13 08:43:24 DEBUG juju.apiserver apiserver.go:305 -> [7] user-admin@local 167.444991ms {"RequestId":2,"Error":"cannot set restore status to \"PENDING\": Another status change occurred concurrently","Response":"'body redacted'"} Backups[""].PrepareRestore

So PrepareRestore is failing and hence so is the restore.

Revision history for this message
Ian Booth (wallyworld) wrote :

The code to update restore status is fundamentally broken.
func (info *RestoreInfo) SetStatus(status RestoreStatus)

This appears to have been broken maybe in a MADE related refactoring. Some thought will be needed to fix it up.

tags: removed: blocker
Changed in juju-core:
milestone: 2.0-beta4 → 2.0-rc1
summary: - backup-restore loses the hosted model
+ juju restore-backup does not complete properly
Changed in juju-core:
assignee: Cheryl Jennings (cherylj) → nobody
William Reade (fwereade)
Changed in juju-core:
assignee: nobody → William Reade (fwereade)
Revision history for this message
Curtis Hovey (sinzui) wrote :

From the duplicate

The assess_recovery.py script used by several tests can report success when it's actually failing late on.

<http://reports.vapour.ws/releases/3881/job/functional-backup-restore/attempt/4002>

2016-04-12 20:58:40 ERROR cmd supercommand.go:448 connecting with bootstrap config: unknown model: "7aef8efb-9fb8-4cb0-8751-a2e40e1a82fa" (not found)
2016-04-12 20:58:40 ERROR Command '('juju', '--show-log', 'show-status', '-m', 'functional-backup-restore', '--format', 'yaml')' returned non-zero exit status 1
...
Finished: SUCCESS

This is hiding breakage from the reports site.

Revision history for this message
Curtis Hovey (sinzui) wrote :

The assess-rescovery.py test needs an update to fail juju when juju looses models/instances. The the test does not provide enough information about the controllers, models, or show the machines in the admin and hosted models. we also want the recovery test to do some thing to exercise juju after juju reports it is successful.

Changed in juju-ci-tools:
status: New → Triaged
importance: Undecided → High
Changed in juju-core:
assignee: William Reade (fwereade) → Cheryl Jennings (cherylj)
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta5 → 2.0-rc1
Ian Booth (wallyworld)
Changed in juju-core:
assignee: Cheryl Jennings (cherylj) → Ian Booth (wallyworld)
status: Triaged → In Progress
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta6 → 2.0-beta7
Revision history for this message
Ian Booth (wallyworld) wrote :
Revision history for this message
Martin Packman (gz) wrote :

Unduping bug 1570000 for the cooler bug number for the ci side of the changes needed.

no longer affects: juju-ci-tools
Ian Booth (wallyworld)
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta7 → none
milestone: none → 2.0-beta7
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.