Canonical Juju

juju restore-backup does not complete properly

Bug #1569467 reported by Curtis Hovey on 2016-04-12

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	Critical	Ian Booth	Canonical Juju 2.0-beta7

Bug Description

As seen in this false success
http://reports.vapour.ws/releases/3881/job/functional-backup-restore/attempt/4000
reports a success because the the controller was restored, but the errors during kill-controller reveal the hosted model was lost. So while the controller was restored, Users cannot manage the hosted models that they ran restore to get access to.

Tags:

Cheryl Jennings (cherylj) on 2016-04-12

Changed in juju-core:
assignee:	nobody → Cheryl Jennings (cherylj)
tags:	added: blocker

Cheryl Jennings (cherylj) on 2016-04-13

Changed in juju-core:
importance:	High → Critical

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-04-13:

Another CI run was done with --verbose which gives more output: http://juju-ci.vapour.ws:8080/view/Juju%20Revisions/job/functional-backup-restore/4002/

When the controller is bootstrapped, the default model is given the name "functional-backup-restore". However, post-restore, we see this as the models in the controller:

2016-04-12 20:58:40 DEBUG ('juju', '--show-log', 'list-models', '-c', 'functional-backup-restore', '--format', 'yaml')
2016-04-12 20:58:40 DEBUG models:
- name: admin
  model-uuid: b64b83e7-3ac2-4398-857f-ecd64bb37c71
  owner: admin@local
  last-connection: 2 seconds ago
- name: default
  model-uuid: ab08a8c9-f6fe-4ee4-8fd9-b919dba63b8b
  owner: admin@local
  last-connection: never connected
current-model: admin

I am able to backup / restore with the hosted model just fine in joyent (AWS has issues - mongo keeps getting SIGKILL, maybe my instance size is too small?)

I'm wondering if something like that is happening here where the restore doesn't actually complete.

Curtis - can we get a run that keeps the environment up on failure?

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-04-13:

Some other things to note:

- This is not a new failure. It just wasn't being identified previously. The last run of master which did NOT have this failure was: http://reports.vapour.ws/releases/3833

- The first run on master to have this failure was: http://reports.vapour.ws/releases/3845

- MADE-model workers was merged into master during that time, and I see that this failure occurred on that branch before it was seen in master.
- First MADE-model-workers run with this failure: http://reports.vapour.ws/releases/3826
- Last MADE-model-workers run without this failure: http://reports.vapour.ws/releases/3795

- William made a change to restore around this time via bug #1559712 There's no PR in the bug, and nothing in the git log that references that bug. But, maybe it's a starting point.

Revision history for this message

Ian Booth (wallyworld) wrote on 2016-04-13:

I can reproduce this on LXD. I have observed that:
- the bson dump files from the backup does contain the hosted model document
- the restore -b bootstraps a new controller and hence creates an empty hosted model called "default"
- only the admin model is restored

So any notion that the restore worked with a hosted model was sort of a red herring. If the original hosted model was called "default" then list-models in a restored system would have shown a model called "default" but it would have been empty.

Why is only the admin model restored, not sure yet. It seems that mongorestore is called and loads the entire dump so am not sure what's happening.

Revision history for this message

Ian Booth (wallyworld) wrote on 2016-04-13:

My next thought was that restore is not even working at all. So we are just ending up with the default bootstrapped admin and default hosted model. juju status confirms this. So restore is initiated:

2016-04-13 08:43:24 DEBUG juju.api.backups restore.go:61 Will attempt to call 'PrepareRestore'
2016-04-13 08:43:24 INFO juju.juju api.go:213 connecting to API addresses: [10.0.4.35:17070]

But fails. Interesting log entries:

machine-0: 2016-04-13 08:43:24 DEBUG juju.apiserver apiserver.go:305 -> [7] user-admin@local 167.444991ms {"RequestId":2,"Error":"cannot set restore status to \"PENDING\": Another status change occurred concurrently","Response":"'body redacted'"} Backups[""].PrepareRestore

So PrepareRestore is failing and hence so is the restore.

Revision history for this message

Ian Booth (wallyworld) wrote on 2016-04-13:

The code to update restore status is fundamentally broken.
func (info *RestoreInfo) SetStatus(status RestoreStatus)

This appears to have been broken maybe in a MADE related refactoring. Some thought will be needed to fix it up.

Cheryl Jennings (cherylj) on 2016-04-13

tags:	removed: blocker
Changed in juju-core:
milestone:	2.0-beta4 → 2.0-rc1
summary:	- backup-restore loses the hosted model + juju restore-backup does not complete properly

Cheryl Jennings (cherylj) on 2016-04-14

Changed in juju-core:
assignee:	Cheryl Jennings (cherylj) → nobody

William Reade (fwereade) on 2016-04-19

Changed in juju-core:
assignee:	nobody → William Reade (fwereade)

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-04-20:

From the duplicate

The assess_recovery.py script used by several tests can report success when it's actually failing late on.

<http://reports.vapour.ws/releases/3881/job/functional-backup-restore/attempt/4002>

2016-04-12 20:58:40 ERROR cmd supercommand.go:448 connecting with bootstrap config: unknown model: "7aef8efb-9fb8-4cb0-8751-a2e40e1a82fa" (not found)
2016-04-12 20:58:40 ERROR Command '('juju', '--show-log', 'show-status', '-m', 'functional-backup-restore', '--format', 'yaml')' returned non-zero exit status 1
...
Finished: SUCCESS

This is hiding breakage from the reports site.

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-04-20:

The assess-rescovery.py test needs an update to fail juju when juju looses models/instances. The the test does not provide enough information about the controllers, models, or show the machines in the admin and hosted models. we also want the recovery test to do some thing to exercise juju after juju reports it is successful.

Changed in juju-ci-tools:
status:	New → Triaged
importance:	Undecided → High

Cheryl Jennings (cherylj) on 2016-04-20

Changed in juju-core:
assignee:	William Reade (fwereade) → Cheryl Jennings (cherylj)

Curtis Hovey (sinzui) on 2016-04-20

Changed in juju-core:
milestone:	2.0-beta5 → 2.0-rc1

Ian Booth (wallyworld) on 2016-04-22

Changed in juju-core:
assignee:	Cheryl Jennings (cherylj) → Ian Booth (wallyworld)
status:	Triaged → In Progress

Curtis Hovey (sinzui) on 2016-04-25

Changed in juju-core:
milestone:	2.0-beta6 → 2.0-beta7

Revision history for this message

Ian Booth (wallyworld) wrote on 2016-04-27:

PR: https://github.com/juju/juju/pull/5292

Revision history for this message

Martin Packman (gz) wrote on 2016-04-28:

Unduping bug 1570000 for the cooler bug number for the ci side of the changes needed.

no longer affects:

juju-ci-tools

Ian Booth (wallyworld) on 2016-04-29

Changed in juju-core:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2016-05-13

Changed in juju-core:
status:	Fix Committed → Fix Released

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:	juju-core → juju
Changed in juju:
milestone:	2.0-beta7 → none
milestone:	none → 2.0-beta7

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1559715

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.